From owner-freebsd-fs@freebsd.org Mon Aug 27 13:47:51 2018 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6A021108A667 for ; Mon, 27 Aug 2018 13:47:51 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from mail1.yamagi.org (mail1.yamagi.org [IPv6:2001:19f0:5001:17bd::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0D24F70CD9 for ; Mon, 27 Aug 2018 13:47:51 +0000 (UTC) (envelope-from lists@yamagi.org) Received: from [212.48.125.108] (helo=aka) by mail1.yamagi.org with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256) (Exim 4.91 (FreeBSD)) (envelope-from ) id 1fuHrp-000F1v-It; Mon, 27 Aug 2018 15:47:49 +0200 Date: Mon, 27 Aug 2018 15:47:27 +0200 From: Yamagi Burmeister To: freebsd-fs@freebsd.org Cc: MacCrest@gmail.com Subject: ZFS (ARC) performance regression in r321610 Message-Id: <20180827154727.80f92fff9bbc931b37928d43@yamagi.org> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.32; amd64-portbld-freebsd11.1) Mime-Version: 1.0 Content-Type: multipart/signed; protocol="application/pgp-signature"; micalg="PGP-SHA256"; boundary="Signature=_Mon__27_Aug_2018_15_47_28_+0200_hPWCv9E4_UHjaX1S" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.27 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 27 Aug 2018 13:47:51 -0000 --Signature=_Mon__27_Aug_2018_15_47_28_+0200_hPWCv9E4_UHjaX1S Content-Type: text/plain; charset=US-ASCII Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi, after upgrading our database server from FreeBSD 11.1 to 11.2 we're seeing a major performance regression. On FreeBSD 11.1 everything was fine, both the master and the replication slaves were able to scope with the workload. On FreeBSD 11.2 the master is rather slow and the single threaded replication slaves are falling behind. I managed to track this down to ZFS, especially the ARC seems to be the culprit. The hardware: * 2x Intel Xeon E5-2643v3 with latest microcode. * 128GB RAM * 8x Intel DC 3700 800GB SSD connected to one LSI / Avago / Broadcom=20 SAS9300-8i HBA running the latest firmware (version 16.00.01.00) The software: * FreeBSD/amd64 11.1 and 11.2 with GENERIC kernel * ZFS configured as RAIDZ1 over 6 SSDs, one SSD as spare. * The system itself is running on UFS. * MySQL 5.1 (don't ask, the application needs such an old version) Configuration: * vfs.zfs.arc_max=3D"16G" (The memory is better used by MySQL) * vfs.zfs.compressed_arc_enabled=3D"0" * lz4 compression is enabled on the dataset holding the MySQL data. * Powermanagement (C-States, powerd) ist turned off. The workload is somewhat special. There're hundereds of tables in the database, but >95% of all workload is on only one ~18GB MyISAM table. As long as _only_ this big table get's all the load everythings okay. As soon as MySQL iterates over another big table ("select *" for example) the performance degrades and never recovers. While analyzing the problem I observed that reading over another table seems to "poison" the ARC. While the zfs-stats script insists on a >99% hit rate, the ARCs performance decreases by about 30% as soon as it was "poisoned". And even more strange: A very small change to the ARCs configuration recovers the performance. For example it's enough to lower or increase vfs.zfs.arc_max by one byte. I was able to come up with a test case. The C code can be found inline or here: https://gist.github.com/Yamagi/992733b5eed8d1b7e209b72e42d78509 To run this programm you'll need a box with a ZFS pool and two files on it. Both files must be bigger than the maximum ARC size. For example if vfs.zfs.arc_max=3D"2G" the files should be at least 2GB, better 4GB. All tests should be done after a reboot, to be sure that the ARC is empty. The program does what's decripted above, in more detail: 1. At a first step it does 1.000.000 random reads over file A. This should (and does) show normal cache behavior, the reads are getting faster as the ARC fills up. 2. As second step file B is read from start to end. That "poisons" the ARC. 3. The third step does again 1.000.000 reads over file A. This should start slow (because data from file B is in the ARC) and get faster over time. But on FreeBSD 11.2 it doesn't get any faster, the performance stays degraded. Like the ARC is never hit. 4. The fourth step lowers vfs.zfs.arc_max by one byte.=20 5. And as a final step another 1.000.000 random reads from file A. On FreeBSD 11.2 the performance is now good, just like it was in step 1. With this program I was able to bisect the source and identify commit r321610 (MFV r318946: 8021 ARC buf data scatter-ization) as the culprit: https://svnweb.freebsd.org/base?view=3Drevision&revision=3D321610 Of course the big question is: Is this really a regression or have I just the found the one workload were a generally good and useful change is a pessimation? And why is a small change to vfs.zfs.arc_max enough to restore performance? Anyway, running this every 15 minutes through cron just to keep our database servers at good performance seems very, very wrong: #!/bin/sh CURRENT=3D$(sysctl -n vfs.zfs.arc_max) NORMAL=3D"17179869184" if [ $CURRENT -eq $NORMAL ] ; then sysctl vfs.zfs.arc_max=3D$(($NORMAL-1))>/dev/null 2>&1 else sysctl vfs.zfs.arc_max=3D$NORMAL >/dev/null 2>&1 fi Regards, Yamagi -------- These measurements were taken on a cheap desktop box with Intel Core i5 i5-4430, 24GB RAM and a Samsung 840 EVO 120GB SSD. The effect is not as pronounced as on the much faster production servers but still easily observable. r321609 (good): The performance hit in step 3 is about 12%. In step 5 the performance is fully restored. 1. Doing 1000000 random reads from /mnt/file_a: - 100000: 1960 Reads per second - 200000: 2020 Reads per second - 300000: 2040 Reads per second - 400000: 2040 Reads per second - 500000: 2049 Reads per second - 600000: 2054 Reads per second - 700000: 2058 Reads per second - 800000: 2056 Reads per second - 900000: 2059 Reads per second - 1000000: 2061 Reads per second 2. Reading /mnt/file_b from start to end. 3. Doing 1000000 random reads from /mnt/file_a: - 100000: 1724 Reads per second - 200000: 1739 Reads per second - 300000: 1754 Reads per second - 400000: 1754 Reads per second - 500000: 1760 Reads per second - 600000: 1764 Reads per second - 700000: 1767 Reads per second - 800000: 1769 Reads per second - 900000: 1768 Reads per second - 1000000: 1769 Reads per second 4. Shrinking the ARC by one byte. 5. Doing 1000000 random reads from /mnt/file_a: - 100000: 2040 Reads per second - 200000: 2061 Reads per second - 300000: 2054 Reads per second - 400000: 2061 Reads per second - 500000: 2066 Reads per second - 600000: 2061 Reads per second - 700000: 2064 Reads per second - 800000: 2061 Reads per second - 900000: 2064 Reads per second - 1000000: 2066 Reads per second =20 r321610 (bad): Reads are generally much slower than on r321609 (good). The performance hit in step 3 is about 27%. In step 5 the performance get's fully restored. 1. Doing 1000000 random reads from /mnt/file_a: - 100000: 1754 Reads per second - 200000: 1785 Reads per second - 300000: 1807 Reads per second - 400000: 1809 Reads per second - 500000: 1818 Reads per second - 600000: 1818 Reads per second - 700000: 1822 Reads per second - 800000: 1826 Reads per second - 900000: 1825 Reads per second - 1000000: 1828 Reads per second 2. Reading /mnt/file_b from start to end. 3. Doing 1000000 random reads from /mnt/file_a: - 100000: 1265 Reads per second - 200000: 1290 Reads per second - 300000: 1293 Reads per second - 400000: 1298 Reads per second - 500000: 1298 Reads per second - 600000: 1298 Reads per second - 700000: 1301 Reads per second - 800000: 1300 Reads per second - 900000: 1302 Reads per second - 1000000: 1302 Reads per second 4. Shrinking the ARC by one byte. 5. Doing 1000000 random reads from /mnt/file_a: - 100000: 1694 Reads per second - 200000: 1694 Reads per second - 300000: 1704 Reads per second - 400000: 1709 Reads per second - 500000: 1706 Reads per second - 600000: 1709 Reads per second - 700000: 1707 Reads per second - 800000: 1709 Reads per second - 900000: 1707 Reads per second - 1000000: 1709 Reads per second=20 -------- // clang -g -Wall -O2 -o zfsarc zfsarc.c #include #include #include #include #include #include #include #include #include #define ITERATIONS 1000000 void randreads(const char *file, struct stat *sb) { FILE *fd; char buffer[256 * 1024]; struct timeval start; gettimeofday(&start, NULL); if ((fd =3D fopen(file, "r")) =3D=3D NULL) { fprintf(stderr, "Couldn't open %s: %s\n", file, strerror(errno)); exit(1); } for (int i =3D 1; i <=3D ITERATIONS; i++) { uint32_t bytestoread =3D arc4random_uniform(sizeof(buffer)); uint32_t position =3D arc4random_uniform(sb->st_size - sizeof(buffer)); bytestoread =3D bytestoread ? bytestoread : 1; if (fseek(fd, position, SEEK_SET) !=3D 0) { fprintf(stderr, "Could't seek to %i in %s: %s\n", position, file, strerr= or(errno)); fclose(fd); exit(1); } if (fread(buffer, bytestoread, 1, fd) !=3D 1) { fprintf(stderr, "Couldn't read %i bytes from %s: %s\n", bytestoread, fil= e, strerror(errno)); fclose(fd); exit(1); } if (!(i % 100000)) { struct timeval now; gettimeofday(&now, NULL); printf(" - %i: %li Reads per second\n", i, i / (now.tv_sec - st= art.tv_sec));=20 } } fclose(fd); } void fullread(const char *file, struct stat *sb) { FILE *fd; char buffer[256 * 1024]; const uint32_t iterations =3D sb->st_size / sizeof(buffer); if ((fd =3D fopen(file, "r")) =3D=3D NULL) { fprintf(stderr, "Couldn't open %s: %s\n", file, strerror(errno)); exit(1); } for (int i =3D 1; i <=3D iterations; i++) { uint32_t bytestoread =3D (i =3D=3D iterations) ? sb->st_size % sizeof(buf= fer) : sizeof(buffer); if (bytestoread > 0)=20 { if (fread(buffer, bytestoread, 1, fd) !=3D 1) { fprintf(stderr, "Couldn't read %i bytes from %s: %s\n", bytestoread, fi= le, strerror(errno)); fclose(fd); exit(1); } } } fclose(fd); } int main(int argc, char *argv[]) { // No stdout buffering. setbuf(stdout, NULL); // Get arguments. if (argc !=3D 3) { fprintf(stderr, "Usage: ./test file_a file_b\n"); exit(1); } const char *file_a =3D argv[1]; const char *file_b =3D argv[2]; =09 // Check if the files are large enough. // We need at least the size of the ARC. uint64_t arcmax; size_t arglen =3D sizeof(arcmax); if (sysctlbyname("vfs.zfs.arc_max", &arcmax, &arglen, NULL, 0) !=3D 0) { fprintf(stderr, "Couldn't get vfs.zfs.arc_max: %s\n", strerror(errno)); exit(1); } =09 struct stat sb_a; if (stat(file_a, &sb_a) !=3D 0) { fprintf(stderr, "Couldn't stat %s: %s\n", file_a, strerror(errno)); exit(1); } if (!S_ISREG(sb_a.st_mode)) { fprintf(stderr, "%s is not a file.\n", file_a); exit(1); } if (sb_a.st_size < arcmax) { fprintf(stderr, "%s is to small: Must be at least %lu bytes\n", file_a, a= rcmax); exit(1); } struct stat sb_b; if (stat(file_b, &sb_b) !=3D 0) { fprintf(stderr, "Couldn't stat %s: %s\n", file_b, strerror(errno)); exit(1); } if (!S_ISREG(sb_b.st_mode)) { fprintf(stderr, "%s is not a file.\n", file_b); exit(1); } if (sb_b.st_size < arcmax) { fprintf(stderr, "%s is to small: Must be at least %lu bytes\n", file_b, a= rcmax); exit(1); } // Do ITERATIONS random reads on file A. This will fill up the ARC. // In theory this should start slow and get faster as the ARC fills // up. printf("1. Doing %i random reads from %s:\n", ITERATIONS, file_a); randreads(file_a, &sb_a); // Poison the ARC by reading once through file B. This will write // contents from file B into the ARC, replacing contents from file // A. printf("2. Reading %s from start to end.\n", file_b); fullread(file_b, &sb_b); // Do ITERATIONS random reads on file A. This should start slow and // get faster as contents of file B is replaced by contents of file // A. This can be seen in 11.1-RELEASE, but not in 11.2-RELEASE. printf("3. Doing %i random reads from %s:\n", ITERATIONS, file_a); randreads(file_a, &sb_a); // Shrink the ARC by 1 byte to alter it's configuration. printf("4. Shrinking the ARC by one byte.\n"); arcmax--; if (sysctlbyname("vfs.zfs.arc_max", NULL, NULL, &arcmax, arglen) !=3D 0) { fprintf(stderr, "Couldn't set vfs.zfs.arc_max: %s\n", strerror(errno)); exit(1); } // Do ITERATIONS random reads on file A. In 11.2-RELEASE this will show // good performance, again. printf("5. Doing %i random reads from %s:\n", ITERATIONS, file_a); randreads(file_a, &sb_a); return 0; } --=20 Homepage: https://www.yamagi.org Github: https://github.com/yamagi GPG: 0x1D502515 --Signature=_Mon__27_Aug_2018_15_47_28_+0200_hPWCv9E4_UHjaX1S Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEOXu/lxyufwz0gC5x6xRy5x1QJRUFAluEAPAACgkQ6xRy5x1Q JRW3QBAAupdIEY1Vu7TKz8R95VLCH8Qewu7/UterkJidYZEWCKt45PUpHf8/y7QR NHnOYILIfohB2r7Wcu3D8Z8s9bCASaG1f+RI5ewHm/Zm+rQTBZmAxg7/VmoPB3na VnS6iuPphNZqx7iSEvqdVGutys0pXV3slSbnOOx/+O/GF8pYDx7tmSYHozHcMjIu TkZxmUuVff+zSS5sUL1MO5m1plU+jPMuGkCfX48BkaeVrYSoT+8T5KxPz1xdbG74 eL6gawp/eaQSn+3TXZ6J6AbY/+kPhyV7PzcAdxIhYVyfGDb5uRq5zr1angpB7t+9 0NzTJfVcpuytSIsCpKILDF2A0K6j/qw03ekj7uk3jUUvnNx0idYw+rR++fb8ZaAd /RocNaG3HJY6QmQ+ZhDJmIQ+gs+KpdoDP6YmXqJQMkLXgj2Tnv/Q5863fGAHw41q wPaC6tsMydkSt+DGUCgl8Wqol6VuJhJrSSN3bmortreBomsN94Os31U59LvpTIyj SxPIofZlj1iz7B7H9lQbCiFCG9ff/mD6K0P8QO/RNhrawmBdXFJaK/P3FWG4CvHi KkjRV+HoTvJgDht1a61Qfil5VKIhY9wBn58ZYnwElEPwctj5jXiZkEmu1EiQJeOi ElGAJEJMUEkvwuBGK6KF85jDbQnPC7CjA7zXebcHTX4cfqcXidM= =70AT -----END PGP SIGNATURE----- --Signature=_Mon__27_Aug_2018_15_47_28_+0200_hPWCv9E4_UHjaX1S--