Skip site navigation (1)Skip section navigation (2)
Date:      Thu, 11 Oct 2018 12:59:41 +1300
From:      Thomas Munro <munro@ip9.org>
To:        freebsd-hackers@freebsd.org
Cc:        alc@freebsd.org, mjg@freebsd.org, Konstantin Belousov <kib@freebsd.org>
Subject:   PostgresSQL vs super pages
Message-ID:  <CADLWmXU=7QM-oHmY=TMAQanQE-dnXY4v74Zm1kkEz3Gc=ip21A@mail.gmail.com>

next in thread | raw e-mail | index | archive | help
--00000000000082055d0577e8a67d
Content-Type: text/plain; charset="UTF-8"

Hello FreeBSD hackers,

In version 11 of PostgreSQL (about to escape) we've introduced shared
memory-based parallel hash joins.  Academic reports and basic
intuition tell me that big hash tables should benefit from super pages
due to random access.  I'm interested in exploring this effect on
FreeBSD.  How can I encourage super pages in short-lived mappings,
given our access pattern?

I don't have a good understanding of virtual memory (though I'm trying
to learn), but let me explain what we're doing, and what I see so far
on FreeBSD.  We have two kinds of shared memory, and here I'm
interested in the second one:

1.  We have a big fixed sized area that lives as long as the database
is up, that holds our buffer pool and other shared state.  It is
inherited by every PostgreSQL process.  Its size is set in
postgresql.conf with eg shared_buffers = '1GB'.  On any long running
not-completely-idle system I eventually see that it is using super
pages (though I wish procstat -v would tell me how many):

  752        0x802e00000        0x80bb9e000 rw- 3730 3730   5   0 --S- df

That's cool, and we know from benchmarks and experience on Linux and
Windows (where we explicitly request huge/large pages with MAP_HUGETLB
and MEM_LARGE_PAGES respectively) that this has beneficial performance
effects, so I'm happy that FreeBSD eventually reaches that state too
(though I haven't yet grokked exactly how and when that happens or
attempted to measure its impact on FreeBSD).  No problem here AFAIK.

2.  In recent versions of PostgreSQL we got parallel computing fever,
and decided we needed more dynamic shared memory for that.  So we
create chunks of memory with shm_open(), size them with ftruncate()
and then map them into our main process + worker processes (yeah,
still no threads after 30 years).  This is where parallel hash join
data goes.

To get rid of obvious obstacles to super page promotion, I patched my
local copy of PostgreSQL to make sure that we always ask for multiples
of 2MB and set MAP_ALIGNED_SUPER, but still no cigar (or not much
cigar, anyway).  Here's what I see on my super slow laptop running
recent HEAD.  I'll put repro instructions below in case anyone is
interested.  I ran a 3-process ~380MB join that takes ~90s.  These
mappings appeared in my memory:

18288        0x826600000        0x82a621000 rw- 16385 16385   4   0 ---- df
18288        0x82a800000        0x82ac00000 rw- 1024 1024   4   0 ---- df
18288        0x82ac00000        0x82b000000 rw- 1024 1024   4   0 ---- df
18288        0x82b000000        0x82b400000 rw- 1024 1024   4   0 ---- df
18288        0x82b400000        0x82b800000 rw- 1024 1024   4   0 ---- df
18288        0x82b800000        0x82c000000 rw- 2048 2048   4   0 --S- df
18288        0x82c000000        0x82c800000 rw- 2048 2048   4   0 ---- df
18288        0x82c800000        0x82d000000 rw- 2048 2048   4   0 ---- df
18288        0x82d000000        0x82d800000 rw- 2048 2048   4   0 ---- df
18288        0x82d800000        0x82e800000 rw- 4096 4096   4   0 ---- df
18288        0x82e800000        0x82f800000 rw- 4096 4096   4   0 ---- df
18288        0x82f800000        0x830800000 rw- 4096 4096   4   0 ---- df
18288        0x830800000        0x831800000 rw- 4096 4096   4   0 ---- df
18288        0x831800000        0x833800000 rw- 8192 8192   4   0 --S- df
18288        0x833800000        0x835800000 rw- 8192 8192   4   0 ---- df
18288        0x835800000        0x837800000 rw- 8192 8192   4   0 ---- df
18288        0x837800000        0x839800000 rw- 8192 8192   4   0 ---- df
18288        0x839800000        0x83d800000 rw- 16102 16102   4   0 ---- df

That's actually the best case I've seen, with two S.  Usually there
are no cases of S, and sometimes just 1.  The big mapping at the top
holds the hash table buckets, and I've never seen an S there.  The
rest of them hold tuples.

Looking at the output of sysctl vm.pmap before and after a run, I saw:

vm.pmap.ad_emulation_superpage_promotions: 0
vm.pmap.num_superpage_accessed_emulations: 0
vm.pmap.num_accessed_emulations: 0
vm.pmap.num_dirty_emulations: 0
vm.pmap.pdpe.demotions: no change
vm.pmap.pde.promotions: +20
vm.pmap.pde.p_failures: +1
vm.pmap.pde.mappings: no change
vm.pmap.pde.demotions: +48
vm.pmap.pcid_save_cnt: 21392597
vm.pmap.pti: 1
vm.pmap.invpcid_works: 1
vm.pmap.pcid_enabled: 1
vm.pmap.pg_ps_enabled: 1
vm.pmap.pat_works: 1

With the attached patch, the syscalls look like this in truss in the
backend that creates each shm segment:

shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e)
ftruncate(46,0x400000)                           = 0 (0x0)
mmap(0x0,4194304,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0)
= 35081158656 (0x82b000000)
close(46)                                        = 0 (0x0)

... and like this in other backends that map them in:

shm_open("/PostgreSQL.1214430502",O_RDWR,0600)   = 46 (0x2e)
fstat(46,{ mode=-rw------- ,inode=20,size=8388608,blksize=4096 }) = 0 (0x0)
mmap(0x0,8388608,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0)
= 35110518784 (0x82cc00000)
close(46)                                        = 0 (0x0)

The access pattern for the memory is as follows:

1.  In the "build" phase we first initialise the bucket segment with
zeroes (sequential), and then load all the tuples into the other
segments (sequential) and insert them into the buckets (random,
compare-and-swap).  We add more segments as necessary, gradually
cranking up the sizes.

2.  In the "probe" phase, all access is read only.  We probe the
buckets (random) and follow pointers to tuples in the other segments
(random).

Afterwards we unmap them and shm_unlink() them, and the parallel
worker processes exit.  It's possibly that we'll want to recycle
memory segments and worker processes in future, but I thought I'd
point out that we don't do that in case it's relevant.

I understand that the philosophy is not to provide explicit control
over page size.  That makes sense, but I'd be grateful for any tips on
how to encourage super pages for this use case.

Thanks,

Thomas Munro

====

How to see this (assuming you have git, gmake, flex, bison, readline,
curl, ccache):

# grab postgres
git clone https://github.com/postgres/postgres.git
cd postgres

# you might want to apply the attached patch to get aligned segments
patch -p1 < super-aligned.patch

# build
./configure --prefix=$HOME/install --with-includes=/usr/local/include
--with-libs=/usr/local/lib CC="ccache cc"
gmake -s -j8
gmake -s install
gmake -C contrib/pg_prewarm install

# create a db cluster
~/install/bin/initdb -D ~/pgdata
echo "shared_buffers = '1GB'" >> ~/pgdata/postgresql.conf

# you can either start (and later stop) postgres in the background with pg_ctl:
~/install/bin/pg_ctl start -D ~/pgdata
# ... or just run it in the foreground and hit ^C to stop it:
# ~/install/bin/postgres -D ~/pgdata

# run the psql shell
~/install/bin/psql postgres

# inside psql, find your backend's pid
# (you can also find the parallel workers with top, but they come and
go with each query)
select pg_backend_pid();

# create a table and set memory size to avoid more complicated
batching behaviour
create table t as select generate_series(1, 8000000)::int i;
analyze t;
set work_mem = '1GB';

# if for some reason you want to change the number of parallel workers, try:
# set max_parallel_workers_per_gather = 2;

# this is quite handy for removing all disk IO from the picture
create extension pg_prewarm;
select pg_prewarm('t'::regclass);

# run a toy parallel hash join
explain analyze select count(*) from t t1 join t t2 using (i);

In procstat -v you should see that it spends about half its time
"building" which looks like slowly adding new mappings and touching
more and more pages, and then about half of its time "probing", where
there are no further changes visible in procstat -v.  If your results
are like mine, only after building will you see any S mappings appear,
and then only rarely.

--00000000000082055d0577e8a67d
Content-Type: application/octet-stream; name="super-aligned.patch"
Content-Disposition: attachment; filename="super-aligned.patch"
Content-Transfer-Encoding: base64
Content-ID: <f_jn3t1p8f0>
X-Attachment-Id: f_jn3t1p8f0

ZGlmZiAtLWdpdCBhL3NyYy9iYWNrZW5kL3N0b3JhZ2UvaXBjL2RzbV9pbXBsLmMgYi9zcmMvYmFj
a2VuZC9zdG9yYWdlL2lwYy9kc21faW1wbC5jCmluZGV4IDcwZjg5OWU3NjUuLjRhZjE2MzgxZTkg
MTAwNjQ0Ci0tLSBhL3NyYy9iYWNrZW5kL3N0b3JhZ2UvaXBjL2RzbV9pbXBsLmMKKysrIGIvc3Jj
L2JhY2tlbmQvc3RvcmFnZS9pcGMvZHNtX2ltcGwuYwpAQCAtMzc3LDYgKzM3Nyw5IEBAIGRzbV9p
bXBsX3Bvc2l4KGRzbV9vcCBvcCwgZHNtX2hhbmRsZSBoYW5kbGUsIFNpemUgcmVxdWVzdF9zaXpl
LAogCiAJLyogTWFwIGl0LiAqLwogCWFkZHJlc3MgPSBtbWFwKE5VTEwsIHJlcXVlc3Rfc2l6ZSwg
UFJPVF9SRUFEIHwgUFJPVF9XUklURSwKKyNpZmRlZiBNQVBfQUxJR05FRF9TVVBFUgorCQkJCSAg
IE1BUF9BTElHTkVEX1NVUEVSIHwKKyNlbmRpZgogCQkJCSAgIE1BUF9TSEFSRUQgfCBNQVBfSEFT
U0VNQVBIT1JFIHwgTUFQX05PU1lOQywgZmQsIDApOwogCWlmIChhZGRyZXNzID09IE1BUF9GQUlM
RUQpCiAJewpkaWZmIC0tZ2l0IGEvc3JjL2JhY2tlbmQvdXRpbHMvbW1nci9kc2EuYyBiL3NyYy9i
YWNrZW5kL3V0aWxzL21tZ3IvZHNhLmMKaW5kZXggM2I3ZTJhZDEyNi4uYWU1ZDNiZTI5MCAxMDA2
NDQKLS0tIGEvc3JjL2JhY2tlbmQvdXRpbHMvbW1nci9kc2EuYworKysgYi9zcmMvYmFja2VuZC91
dGlscy9tbWdyL2RzYS5jCkBAIC02Myw5ICs2MywxMCBAQAogICogVGhlIHNpemUgb2YgdGhlIGlu
aXRpYWwgRFNNIHNlZ21lbnQgdGhhdCBiYWNrcyBhIGRzYV9hcmVhIGNyZWF0ZWQgYnkKICAqIGRz
YV9jcmVhdGUuICBBZnRlciBjcmVhdGluZyBzb21lIG51bWJlciBvZiBzZWdtZW50cyBvZiB0aGlz
IHNpemUgd2UnbGwKICAqIGRvdWJsZSB0aGlzIHNpemUsIGFuZCBzbyBvbi4gIExhcmdlciBzZWdt
ZW50cyBtYXkgYmUgY3JlYXRlZCBpZiBuZWNlc3NhcnkKLSAqIHRvIHNhdGlzZnkgbGFyZ2UgcmVx
dWVzdHMuCisgKiB0byBzYXRpc2Z5IGxhcmdlIHJlcXVlc3RzLiAgMk1CIGlzIGNob3NlbiB0byBt
YXRjaCB0aGUgc2l6ZSBvZiBodWdlIHBhZ2VzCisgKiBvbiBjb21tb24gaGFyZHdhcmUgKHRob3Vn
aCBub3QgbWFueSBPU2VzIGN1cnJlbnRseSB0cnkgdG8gdXNlIHRoZW0pLgogICovCi0jZGVmaW5l
IERTQV9JTklUSUFMX1NFR01FTlRfU0laRSAoKHNpemVfdCkgKDEgKiAxMDI0ICogMTAyNCkpCisj
ZGVmaW5lIERTQV9JTklUSUFMX1NFR01FTlRfU0laRSAoKHNpemVfdCkgKDIgKiAxMDI0ICogMTAy
NCkpCiAKIC8qCiAgKiBIb3cgbWFueSBzZWdtZW50cyB0byBjcmVhdGUgYmVmb3JlIHdlIGRvdWJs
ZSB0aGUgc2VnbWVudCBzaXplLiAgSWYgdGhpcyBpcwpAQCAtMjA5MCw5ICsyMDkxLDggQEAgbWFr
ZV9uZXdfc2VnbWVudChkc2FfYXJlYSAqYXJlYSwgc2l6ZV90IHJlcXVlc3RlZF9wYWdlcykKIAkg
KiB3ZSB3b3VsZG4ndCBuZWVkIHRvIHVzZSBGcmVlUGFnZU1hbmFnZXIpLgogCSAqCiAJICogV2Ug
ZGVjaWRlIG9uIGEgdG90YWwgc2VnbWVudCBzaXplIGZpcnN0LCBzbyB0aGF0IHdlIHByb2R1Y2Ug
dGlkeQotCSAqIHBvd2VyLW9mLXR3byBzaXplZCBzZWdtZW50cy4gIFRoaXMgaXMgYSBnb29kIHBy
b3BlcnR5IHRvIGhhdmUgaWYgd2UKLQkgKiBtb3ZlIHRvIGh1Z2UgcGFnZXMgaW4gdGhlIGZ1dHVy
ZS4gIFRoZW4gd2Ugd29yayBiYWNrIHRvIHRoZSBudW1iZXIgb2YKLQkgKiBwYWdlcyB3ZSBjYW4g
Zml0LgorCSAqIHBvd2VyLW9mLXR3byBzaXplZCBzZWdtZW50cyB0byBhbGlnbiB3aXRoIG9wZXJh
dGluZyBzeXN0ZW0gcGFnZXMuCisJICogVGhlbiB3ZSB3b3JrIGJhY2sgdG8gdGhlIG51bWJlciBv
ZiBwYWdlcyB3ZSBjYW4gZml0LgogCSAqLwogCXRvdGFsX3NpemUgPSBEU0FfSU5JVElBTF9TRUdN
RU5UX1NJWkUgKgogCQkoKHNpemVfdCkgMSA8PCAobmV3X2luZGV4IC8gRFNBX05VTV9TRUdNRU5U
U19BVF9FQUNIX1NJWkUpKTsK
--00000000000082055d0577e8a67d--



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CADLWmXU=7QM-oHmY=TMAQanQE-dnXY4v74Zm1kkEz3Gc=ip21A>