From owner-freebsd-hackers@freebsd.org Wed Oct 10 23:59:55 2018 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4B5E110C03A7 for ; Wed, 10 Oct 2018 23:59:55 +0000 (UTC) (envelope-from munro@penski.net) Received: from mail-ed1-x542.google.com (mail-ed1-x542.google.com [IPv6:2a00:1450:4864:20::542]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B56987741E for ; Wed, 10 Oct 2018 23:59:54 +0000 (UTC) (envelope-from munro@penski.net) Received: by mail-ed1-x542.google.com with SMTP id c26-v6so6572084edt.3 for ; Wed, 10 Oct 2018 16:59:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ip9-org.20150623.gappssmtp.com; s=20150623; h=mime-version:from:date:message-id:subject:to:cc; bh=UEx2aorFrp6z6QLtGLEr4opajjQ7MYj6H9PG+T5s6WQ=; b=CB3gEwI6awNmLSDMW6zPJqwxWOULfH+fIjxtpA3N5W76z4N6NBq/qVWHuzbhV2/J1u e4pOi3XOOvhMdGsSpG4JoKuxLiKgBJ4R0JcESjrpyFTeBlL9fzOANdXTiUkY4EokGpGo uFXklzqbz7a4/lrT5CyeyPE4y23BUdeCKsWglJ9s1OvJsSsIOWYf19UENXMOpcokMlJX 8+q3HXRngfH5G7cp8+yQ2JDo4e1dUqRWKObipDq3ZGrE/iaXzTLuvTdNWvxHGBQUg1YT IR0AQlWry4a8dlkxcYh5R3CYasaMjmpCoKTePBN544EO7Cn9Bvd/UJEzOsy7MNKV8qLJ OUnQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=UEx2aorFrp6z6QLtGLEr4opajjQ7MYj6H9PG+T5s6WQ=; b=rnCEf7Iwv9tf1NlyQiBs/q4SGety58ENnc27whCXAahVgcrMoYjMQlUoOQWfrqE5wY 1ipdiQW7Jl5GgCU2MZioWKDeQTcUVDA/5tSmqiISp/rY6YZNhHr5u7k95ynhoOFVoR6x yXQ0eZZFhxMV8k/Ii8OJJdkZdvWdxZMCoxNMYfHo8pohIcQ/zv3k822Jl+DYMzGLQNPM H1qXUkH99Ai72QcFuCaggE6eTR7N5WjjdOtcoCwu2sE0DofXq2YVBrkqH99KE6RWMIKA ms8emF3ZHwMXF5tGd+KE+5ZW1ta2bQ+7D6gbUtKZG/gZCVQaqXiZuUyMPFavRcD3upa5 ylaw== X-Gm-Message-State: ABuFfoh+VWdxL5GRlNCTdHPn/CJgep2dAyHIuGj6vwFK5ttSzQ80L2io C97Ns18PwawVhelgGZphIyj0ZqZku5/RShV3nge6wtRDZCev5w== X-Google-Smtp-Source: ACcGV62nubKXEykU0YvDK6b7qI1Wxc1JaBwbqzthCFENylHSLr4Is2NOF3VGPPIpNELYMKoCvHop5SFLoPYMUNZ3K3o= X-Received: by 2002:a17:906:f254:: with SMTP id gy20-v6mr36262199ejb.2.1539215993199; Wed, 10 Oct 2018 16:59:53 -0700 (PDT) MIME-Version: 1.0 From: Thomas Munro Date: Thu, 11 Oct 2018 12:59:41 +1300 Message-ID: Subject: PostgresSQL vs super pages To: freebsd-hackers@freebsd.org Cc: alc@freebsd.org, mjg@freebsd.org, Konstantin Belousov Content-Type: multipart/mixed; boundary="00000000000082055d0577e8a67d" X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 10 Oct 2018 23:59:55 -0000 --00000000000082055d0577e8a67d Content-Type: text/plain; charset="UTF-8" Hello FreeBSD hackers, In version 11 of PostgreSQL (about to escape) we've introduced shared memory-based parallel hash joins. Academic reports and basic intuition tell me that big hash tables should benefit from super pages due to random access. I'm interested in exploring this effect on FreeBSD. How can I encourage super pages in short-lived mappings, given our access pattern? I don't have a good understanding of virtual memory (though I'm trying to learn), but let me explain what we're doing, and what I see so far on FreeBSD. We have two kinds of shared memory, and here I'm interested in the second one: 1. We have a big fixed sized area that lives as long as the database is up, that holds our buffer pool and other shared state. It is inherited by every PostgreSQL process. Its size is set in postgresql.conf with eg shared_buffers = '1GB'. On any long running not-completely-idle system I eventually see that it is using super pages (though I wish procstat -v would tell me how many): 752 0x802e00000 0x80bb9e000 rw- 3730 3730 5 0 --S- df That's cool, and we know from benchmarks and experience on Linux and Windows (where we explicitly request huge/large pages with MAP_HUGETLB and MEM_LARGE_PAGES respectively) that this has beneficial performance effects, so I'm happy that FreeBSD eventually reaches that state too (though I haven't yet grokked exactly how and when that happens or attempted to measure its impact on FreeBSD). No problem here AFAIK. 2. In recent versions of PostgreSQL we got parallel computing fever, and decided we needed more dynamic shared memory for that. So we create chunks of memory with shm_open(), size them with ftruncate() and then map them into our main process + worker processes (yeah, still no threads after 30 years). This is where parallel hash join data goes. To get rid of obvious obstacles to super page promotion, I patched my local copy of PostgreSQL to make sure that we always ask for multiples of 2MB and set MAP_ALIGNED_SUPER, but still no cigar (or not much cigar, anyway). Here's what I see on my super slow laptop running recent HEAD. I'll put repro instructions below in case anyone is interested. I ran a 3-process ~380MB join that takes ~90s. These mappings appeared in my memory: 18288 0x826600000 0x82a621000 rw- 16385 16385 4 0 ---- df 18288 0x82a800000 0x82ac00000 rw- 1024 1024 4 0 ---- df 18288 0x82ac00000 0x82b000000 rw- 1024 1024 4 0 ---- df 18288 0x82b000000 0x82b400000 rw- 1024 1024 4 0 ---- df 18288 0x82b400000 0x82b800000 rw- 1024 1024 4 0 ---- df 18288 0x82b800000 0x82c000000 rw- 2048 2048 4 0 --S- df 18288 0x82c000000 0x82c800000 rw- 2048 2048 4 0 ---- df 18288 0x82c800000 0x82d000000 rw- 2048 2048 4 0 ---- df 18288 0x82d000000 0x82d800000 rw- 2048 2048 4 0 ---- df 18288 0x82d800000 0x82e800000 rw- 4096 4096 4 0 ---- df 18288 0x82e800000 0x82f800000 rw- 4096 4096 4 0 ---- df 18288 0x82f800000 0x830800000 rw- 4096 4096 4 0 ---- df 18288 0x830800000 0x831800000 rw- 4096 4096 4 0 ---- df 18288 0x831800000 0x833800000 rw- 8192 8192 4 0 --S- df 18288 0x833800000 0x835800000 rw- 8192 8192 4 0 ---- df 18288 0x835800000 0x837800000 rw- 8192 8192 4 0 ---- df 18288 0x837800000 0x839800000 rw- 8192 8192 4 0 ---- df 18288 0x839800000 0x83d800000 rw- 16102 16102 4 0 ---- df That's actually the best case I've seen, with two S. Usually there are no cases of S, and sometimes just 1. The big mapping at the top holds the hash table buckets, and I've never seen an S there. The rest of them hold tuples. Looking at the output of sysctl vm.pmap before and after a run, I saw: vm.pmap.ad_emulation_superpage_promotions: 0 vm.pmap.num_superpage_accessed_emulations: 0 vm.pmap.num_accessed_emulations: 0 vm.pmap.num_dirty_emulations: 0 vm.pmap.pdpe.demotions: no change vm.pmap.pde.promotions: +20 vm.pmap.pde.p_failures: +1 vm.pmap.pde.mappings: no change vm.pmap.pde.demotions: +48 vm.pmap.pcid_save_cnt: 21392597 vm.pmap.pti: 1 vm.pmap.invpcid_works: 1 vm.pmap.pcid_enabled: 1 vm.pmap.pg_ps_enabled: 1 vm.pmap.pat_works: 1 With the attached patch, the syscalls look like this in truss in the backend that creates each shm segment: shm_open("/PostgreSQL.1721888107",O_RDWR|O_CREAT|O_EXCL,0600) = 46 (0x2e) ftruncate(46,0x400000) = 0 (0x0) mmap(0x0,4194304,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0) = 35081158656 (0x82b000000) close(46) = 0 (0x0) ... and like this in other backends that map them in: shm_open("/PostgreSQL.1214430502",O_RDWR,0600) = 46 (0x2e) fstat(46,{ mode=-rw------- ,inode=20,size=8388608,blksize=4096 }) = 0 (0x0) mmap(0x0,8388608,PROT_READ|PROT_WRITE,MAP_SHARED|MAP_HASSEMAPHORE|MAP_NOSYNC|MAP_ALIGNED_SUPER,46,0x0) = 35110518784 (0x82cc00000) close(46) = 0 (0x0) The access pattern for the memory is as follows: 1. In the "build" phase we first initialise the bucket segment with zeroes (sequential), and then load all the tuples into the other segments (sequential) and insert them into the buckets (random, compare-and-swap). We add more segments as necessary, gradually cranking up the sizes. 2. In the "probe" phase, all access is read only. We probe the buckets (random) and follow pointers to tuples in the other segments (random). Afterwards we unmap them and shm_unlink() them, and the parallel worker processes exit. It's possibly that we'll want to recycle memory segments and worker processes in future, but I thought I'd point out that we don't do that in case it's relevant. I understand that the philosophy is not to provide explicit control over page size. That makes sense, but I'd be grateful for any tips on how to encourage super pages for this use case. Thanks, Thomas Munro ==== How to see this (assuming you have git, gmake, flex, bison, readline, curl, ccache): # grab postgres git clone https://github.com/postgres/postgres.git cd postgres # you might want to apply the attached patch to get aligned segments patch -p1 < super-aligned.patch # build ./configure --prefix=$HOME/install --with-includes=/usr/local/include --with-libs=/usr/local/lib CC="ccache cc" gmake -s -j8 gmake -s install gmake -C contrib/pg_prewarm install # create a db cluster ~/install/bin/initdb -D ~/pgdata echo "shared_buffers = '1GB'" >> ~/pgdata/postgresql.conf # you can either start (and later stop) postgres in the background with pg_ctl: ~/install/bin/pg_ctl start -D ~/pgdata # ... or just run it in the foreground and hit ^C to stop it: # ~/install/bin/postgres -D ~/pgdata # run the psql shell ~/install/bin/psql postgres # inside psql, find your backend's pid # (you can also find the parallel workers with top, but they come and go with each query) select pg_backend_pid(); # create a table and set memory size to avoid more complicated batching behaviour create table t as select generate_series(1, 8000000)::int i; analyze t; set work_mem = '1GB'; # if for some reason you want to change the number of parallel workers, try: # set max_parallel_workers_per_gather = 2; # this is quite handy for removing all disk IO from the picture create extension pg_prewarm; select pg_prewarm('t'::regclass); # run a toy parallel hash join explain analyze select count(*) from t t1 join t t2 using (i); In procstat -v you should see that it spends about half its time "building" which looks like slowly adding new mappings and touching more and more pages, and then about half of its time "probing", where there are no further changes visible in procstat -v. If your results are like mine, only after building will you see any S mappings appear, and then only rarely. --00000000000082055d0577e8a67d Content-Type: application/octet-stream; name="super-aligned.patch" Content-Disposition: attachment; filename="super-aligned.patch" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_jn3t1p8f0 ZGlmZiAtLWdpdCBhL3NyYy9iYWNrZW5kL3N0b3JhZ2UvaXBjL2RzbV9pbXBsLmMgYi9zcmMvYmFj a2VuZC9zdG9yYWdlL2lwYy9kc21faW1wbC5jCmluZGV4IDcwZjg5OWU3NjUuLjRhZjE2MzgxZTkg MTAwNjQ0Ci0tLSBhL3NyYy9iYWNrZW5kL3N0b3JhZ2UvaXBjL2RzbV9pbXBsLmMKKysrIGIvc3Jj L2JhY2tlbmQvc3RvcmFnZS9pcGMvZHNtX2ltcGwuYwpAQCAtMzc3LDYgKzM3Nyw5IEBAIGRzbV9p bXBsX3Bvc2l4KGRzbV9vcCBvcCwgZHNtX2hhbmRsZSBoYW5kbGUsIFNpemUgcmVxdWVzdF9zaXpl LAogCiAJLyogTWFwIGl0LiAqLwogCWFkZHJlc3MgPSBtbWFwKE5VTEwsIHJlcXVlc3Rfc2l6ZSwg UFJPVF9SRUFEIHwgUFJPVF9XUklURSwKKyNpZmRlZiBNQVBfQUxJR05FRF9TVVBFUgorCQkJCSAg IE1BUF9BTElHTkVEX1NVUEVSIHwKKyNlbmRpZgogCQkJCSAgIE1BUF9TSEFSRUQgfCBNQVBfSEFT U0VNQVBIT1JFIHwgTUFQX05PU1lOQywgZmQsIDApOwogCWlmIChhZGRyZXNzID09IE1BUF9GQUlM RUQpCiAJewpkaWZmIC0tZ2l0IGEvc3JjL2JhY2tlbmQvdXRpbHMvbW1nci9kc2EuYyBiL3NyYy9i YWNrZW5kL3V0aWxzL21tZ3IvZHNhLmMKaW5kZXggM2I3ZTJhZDEyNi4uYWU1ZDNiZTI5MCAxMDA2 NDQKLS0tIGEvc3JjL2JhY2tlbmQvdXRpbHMvbW1nci9kc2EuYworKysgYi9zcmMvYmFja2VuZC91 dGlscy9tbWdyL2RzYS5jCkBAIC02Myw5ICs2MywxMCBAQAogICogVGhlIHNpemUgb2YgdGhlIGlu aXRpYWwgRFNNIHNlZ21lbnQgdGhhdCBiYWNrcyBhIGRzYV9hcmVhIGNyZWF0ZWQgYnkKICAqIGRz YV9jcmVhdGUuICBBZnRlciBjcmVhdGluZyBzb21lIG51bWJlciBvZiBzZWdtZW50cyBvZiB0aGlz IHNpemUgd2UnbGwKICAqIGRvdWJsZSB0aGlzIHNpemUsIGFuZCBzbyBvbi4gIExhcmdlciBzZWdt ZW50cyBtYXkgYmUgY3JlYXRlZCBpZiBuZWNlc3NhcnkKLSAqIHRvIHNhdGlzZnkgbGFyZ2UgcmVx dWVzdHMuCisgKiB0byBzYXRpc2Z5IGxhcmdlIHJlcXVlc3RzLiAgMk1CIGlzIGNob3NlbiB0byBt YXRjaCB0aGUgc2l6ZSBvZiBodWdlIHBhZ2VzCisgKiBvbiBjb21tb24gaGFyZHdhcmUgKHRob3Vn aCBub3QgbWFueSBPU2VzIGN1cnJlbnRseSB0cnkgdG8gdXNlIHRoZW0pLgogICovCi0jZGVmaW5l IERTQV9JTklUSUFMX1NFR01FTlRfU0laRSAoKHNpemVfdCkgKDEgKiAxMDI0ICogMTAyNCkpCisj ZGVmaW5lIERTQV9JTklUSUFMX1NFR01FTlRfU0laRSAoKHNpemVfdCkgKDIgKiAxMDI0ICogMTAy NCkpCiAKIC8qCiAgKiBIb3cgbWFueSBzZWdtZW50cyB0byBjcmVhdGUgYmVmb3JlIHdlIGRvdWJs ZSB0aGUgc2VnbWVudCBzaXplLiAgSWYgdGhpcyBpcwpAQCAtMjA5MCw5ICsyMDkxLDggQEAgbWFr ZV9uZXdfc2VnbWVudChkc2FfYXJlYSAqYXJlYSwgc2l6ZV90IHJlcXVlc3RlZF9wYWdlcykKIAkg KiB3ZSB3b3VsZG4ndCBuZWVkIHRvIHVzZSBGcmVlUGFnZU1hbmFnZXIpLgogCSAqCiAJICogV2Ug ZGVjaWRlIG9uIGEgdG90YWwgc2VnbWVudCBzaXplIGZpcnN0LCBzbyB0aGF0IHdlIHByb2R1Y2Ug dGlkeQotCSAqIHBvd2VyLW9mLXR3byBzaXplZCBzZWdtZW50cy4gIFRoaXMgaXMgYSBnb29kIHBy b3BlcnR5IHRvIGhhdmUgaWYgd2UKLQkgKiBtb3ZlIHRvIGh1Z2UgcGFnZXMgaW4gdGhlIGZ1dHVy ZS4gIFRoZW4gd2Ugd29yayBiYWNrIHRvIHRoZSBudW1iZXIgb2YKLQkgKiBwYWdlcyB3ZSBjYW4g Zml0LgorCSAqIHBvd2VyLW9mLXR3byBzaXplZCBzZWdtZW50cyB0byBhbGlnbiB3aXRoIG9wZXJh dGluZyBzeXN0ZW0gcGFnZXMuCisJICogVGhlbiB3ZSB3b3JrIGJhY2sgdG8gdGhlIG51bWJlciBv ZiBwYWdlcyB3ZSBjYW4gZml0LgogCSAqLwogCXRvdGFsX3NpemUgPSBEU0FfSU5JVElBTF9TRUdN RU5UX1NJWkUgKgogCQkoKHNpemVfdCkgMSA8PCAobmV3X2luZGV4IC8gRFNBX05VTV9TRUdNRU5U U19BVF9FQUNIX1NJWkUpKTsK --00000000000082055d0577e8a67d--