From owner-freebsd-questions@freebsd.org Wed Jun 8 08:15:21 2016 Return-Path: Delivered-To: freebsd-questions@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E9E07B6F96E for ; Wed, 8 Jun 2016 08:15:21 +0000 (UTC) (envelope-from ciprian.craciun@gmail.com) Received: from mail-vk0-x229.google.com (mail-vk0-x229.google.com [IPv6:2607:f8b0:400c:c05::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A84981FD4 for ; Wed, 8 Jun 2016 08:15:21 +0000 (UTC) (envelope-from ciprian.craciun@gmail.com) Received: by mail-vk0-x229.google.com with SMTP id g67so982870vkb.3 for ; Wed, 08 Jun 2016 01:15:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:from:date:message-id:subject:to; bh=vgZLF0Grmrp8Z50SxNlkURcDLQO24VcsJyn1WaXqlFU=; b=vVCeCpxg54hOQU0Ef/UgtP9s7NswnNBiYi/+sJiOarMTfj3EJVAixmsYQ1RqlxtLho SpAL6uCxRmims+VTUd3K7KIiop4oYsrhNH8uNCqvdVCGRo4wVZbxVmYJ4ziJYL2NgkG9 51c9Pl9NtVnoYp2yRxrmDNfhglYzT6d06RlSSiqlrPb/gS4fwPoibsdCXUm2ZkQYzlEt 8rfgDwsvrJJ9HNvmQfr2ZtREHmig7vCr0a2eiXeOGQlLOE0Cy6zcbHH8XATetqNdK6Zq IdMj5EE/jM6hNHL+lZHseWPhdca5GU5USqloTKYq9fGW3jccqpwcsCidRUUUgq/EcLvc BiMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=vgZLF0Grmrp8Z50SxNlkURcDLQO24VcsJyn1WaXqlFU=; b=TIQrEpjklA5lR9RAZIBPblDxnnQd0sYg2lWzX0QejGbA78mYA3xN5lMB530cOAuUpn GNLecBkD4IQXJNyDCRaMCNU9LKFBk3h8Sc+ZfFoCmI5XpwekDcxrY0nu2Vq79c53SOAN Qt+O4I5/p3QO/VlnAllX4F/4CoMTSLHc+5S93OPQtql4B0CYZQ5qNZAbbxPOaSJdXqW4 SuhUnmJP0HGZG1x6e0Hx+39J57b18nKCeRMjCCUzJUhlBSwxpV7XPeQomfzOaJrLLWAV fMRp+QwsoJhxXU4WUp7aHbZaKDcIiQTUmcTYsHXNaOkqXq9TmaB+JMdTtr8/IH+BirES DFNQ== X-Gm-Message-State: ALyK8tLcv0L558gjgwVCNtcP74JC77V2KeXJfGsAONSJVuL5fSgLHbX8DVtFaCgtimeyRuvOyprmcGLfZA+tJQ== X-Received: by 10.176.5.68 with SMTP id 62mr1633892uax.45.1465373720478; Wed, 08 Jun 2016 01:15:20 -0700 (PDT) MIME-Version: 1.0 Received: by 10.103.90.212 with HTTP; Wed, 8 Jun 2016 01:14:40 -0700 (PDT) From: Ciprian Dorin Craciun Date: Wed, 8 Jun 2016 11:14:40 +0300 Message-ID: Subject: Feedback on UFS2 tuning for large number of small files (~100m) To: freebsd-questions@freebsd.org Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 08 Jun 2016 08:15:22 -0000 Hello all! (Please keep me in CC as I'm not subscribed on the mailing list. Should I perhaps post this to the `freebsd-fs` mailing list?) I would like your feedback on tuning a UFS2 file-system for the following use-case, which is very similar to a maildir mail server. I tried to look for hints on the internet, but found nothing more in-depth than enabling soft-updates, `noatime`, etc. The main usage of the file-system is: * there are 4 separate files stores, each with about 50 million files, all on the same partition; * all of the 4 file stores have a dispersed layout on two levels (i.e. `XX/YY/ZZ...`, where `ZZ...` is a 64 hexadecimal string); (as a consequence there shouldn't be more than one thousand files per leaf folder;) * all of the files above are around 2-3 KiB; * these files are read-mostly, and they are never deleted; * there is almost no access contention, neither read or write; * there are 4 matching "queue" stores, dispersed on a single level, containing symlinks; * each symlink points to a path roughly 100-200 characters in length; * I wouldn't expect more than a few thousand files for each store; * the symlinks are constantly `rename`-d-in and `rename`-d-out in-and-out of these folders; * these folders are constantly listed, by 4-32 parallel processes (not multi-threaded); * (basically I use stores to emulate a queuing system, and I'm careful that each process tries randomly the leaf folders, thus reducing contention; and also pausing if the queue "seems" empty;) As sidenotes: * the partition is backed by two mirrored disks (which I'm assuming are rotating SCSI disks); * persistence in case of power or system failure (i.e. files getting truncated or missing) is not so critical for my use-case; * however file-system consistency on failure (i.e. getting a correct mounted file-system) is important, thus from what I've read from the `mount` man-page, `async` is not an option; * the system has plenty of RAM (32 GiB), however it is constantly under 100% CPU load by processes on nice level 10; * this system is dedicated to the task at hand, therefore there is no other background contention; The problem that prompted me to ask the community for feedback is that under load (i.e. 100% CPU usage by processes on nice level 10), even listing the file-system seems to stall, ranging from a fraction of second up to a few seconds. The output of `iostat -w 30 -d -C -x -I` under load is (the values are cumulated per 30 seconds, thus not average per second): ~~~~ device r/i w/i kr/i kw/i qlen tsvc_t/i sb/i us ni sy in id ada0 1243893.0 4988740.0 6447101.5 311428382.5 600 812579.1 8698.9 0 0 0 0 100 ada1 1243889.0 4988824.0 6429851.0 311428550.5 520 766389.6 8437.3 device r/i w/i kr/i kw/i qlen tsvc_t/i sb/i us ni sy in id ada0 582.0 12510.0 2328.0 152986.5 383 9463.4 28.9 0 3 1 0 96 ada1 587.0 12465.0 2348.0 152806.5 343 9107.8 28.7 device r/i w/i kr/i kw/i qlen tsvc_t/i sb/i us ni sy in id ada0 792.0 12933.0 3168.0 157643.5 542 11178.8 29.1 0 3 1 0 96 ada1 791.0 12893.0 3164.0 157651.5 544 10591.2 28.5 ~~~~ The file-system is mounted with the following options: ~~~~ ufs rw,noatime ~~~~ The `dumpefs` of the file-system outputs the following: ~~~~ magic 19540119 (UFS2) time Sat Jun 4 05:59:23 2016 superblock location 65536 id [ 56cb7a3f 33fd7a56 ] ncg 2897 size 464257019 blocks 449679279 bsize 32768 shift 15 mask 0xffff8000 fsize 4096 shift 12 mask 0xfffff000 frag 8 shift 3 fsbtodb 3 minfree 8% optim time symlinklen 120 maxbsize 32768 maxbpg 4096 maxcontig 4 contigsumsize 4 nbfree 56167793 ndir 265137 nifree 232205846 nffree 9111 bpg 20035 fpg 160280 ipg 80256 unrefs 0 nindir 4096 inopb 128 maxfilesize 2252349704110079 sbsize 4096 cgsize 32768 csaddr 5056 cssize 49152 sblkno 24 cblkno 32 iblkno 40 dblkno 5056 cgrotor 0 fmod 0 ronly 0 clean 0 metaspace 6408 avgfpdir 64 avgfilesize 16384 flags soft-updates+journal fsmnt /some-path volname swuid 0 providersize 464257019 ~~~~ Thus I would like to ask the community what I can tune (even by re-formatting) to make it more "responsive", and alternatively I am open to another file-system type, perhaps more suited for this use-case. Thanks, Ciprian.