From nobody Wed Apr 6 11:15:35 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 31E3D1AA5E3A; Wed, 6 Apr 2022 11:15:38 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu01208b.smtpx.saremail.com (cu01208b.smtpx.saremail.com [195.16.151.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYMPj1Gy7z4cHM; Wed, 6 Apr 2022 11:15:37 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend01.sarenet.es (Postfix) with ESMTPA id 5FF0860C050; Wed, 6 Apr 2022 13:15:35 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="=_bd91ad01df27068b5d25c9b898f7c3df" Date: Wed, 06 Apr 2022 13:15:35 +0200 From: egoitz@ramattack.net To: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org Subject: Desperate with 870 QVO and ZFS Message-ID: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYMPj1Gy7z4cHM X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.151.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-2.67 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip4:195.16.151.0/24:c]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[multipart/mixed,multipart/alternative,text/plain,multipart/related]; HAS_ATTACHMENT(0.00)[]; TO_DN_NONE(0.00)[]; MIME_BASE64_TEXT_BOGUS(1.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; RCVD_TLS_LAST(0.00)[]; NEURAL_HAM_SHORT(-0.99)[-0.994]; MIME_BASE64_TEXT(0.10)[]; FROM_NO_DN(0.00)[]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; NEURAL_HAM_MEDIUM(-0.99)[-0.990]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:+,3:+,4:~,5:~,6:+]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_bd91ad01df27068b5d25c9b898f7c3df Content-Type: multipart/alternative; boundary="=_044aea7dd22a9f76a92b49bbdb55310b" --=_044aea7dd22a9f76a92b49bbdb55310b Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset=US-ASCII Good morning, I write this post with the expectation that perhaps someone could help me I am running some mail servers with FreeBSD and ZFS. They use 870 QVO (not EVO or other Samsung SSD disks) disks as storage. They can easily have from 1500 to 2000 concurrent connections. The machines have 128GB of ram and the CPU is almost absolutely idle. The disk IO is normally at 30 or 40% percent at most. The problem I'm facing is that they could be running just fine and suddenly at some peak hour, the IO goes to 60 or 70% and the machine becomes extremely slow. ZFS is all by default, except the sync parameter which is set disabled. Apart from that the ARC is limited to 64GB. But even this is extremely odd. The used ARC is near 20GB. I have seen, that meta cache in arc is very near to the limit that FreeBSD automatically sets depending on the size of the ARC you set. It seems that almost all ARC is used by meta cache. I have seen this effect in all my mail servers with this hardware and software config. I do attach a zfs-stats output, but from now that the servers are not so loaded as described. I do explain. I run a couple of Cyrus instances in these servers. One as master, one as slave on each server. The commented situation from above, happens when both Cyrus instances become master, so when we are using two Cyrus instances giving service in the same machine. For avoiding issues, know we have balanced and we have a master and a slave in each server. You know, a slave instance has almost no io and only a single connection for replication. So the zfs-stats output is from now we have let's say half of load in each server, because they have one master and one slave instance. As said before, when I place two masters in same server, perhaps all day works, but just at 11:00 am (for example) the IO goes to 60% (it doesn't increase) but it seems like if the IO where not being able to be served, let's say more than a limit. More than a concrete io limit (I'd say 60%). I don't really know if, perhaps the QVO technology could be the guilty here.... because... they say are desktop computers disks... but later... I have get a nice performance when copying for instance mailboxes from five to five.... I can flood a gigabit interface when copying mailboxes between servers from five to five.... they seem to perform.... Could anyone please shed us some light in this issue?. I don't really know what to think. Best regards, --=_044aea7dd22a9f76a92b49bbdb55310b Content-Type: multipart/related; boundary="=_c643a5b1c3a123a0b3bc0c046160d9bb" --=_c643a5b1c3a123a0b3bc0c046160d9bb Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8
Good morning,

I write this post with = the expectation that perhaps someone could help me 3D":)"

I am running some mail servers with FreeBSD and ZFS. They use 870 = QVO (not EVO or other Samsung SSD disks) disks as storage. They can easily = have from 1500 to 2000 concurrent connections. The machines have 128GB of r= am and the CPU is almost absolutely idle. The disk IO is normally at 30 or = 40% percent at most.

The problem I'm facing is that they could= be running just fine and suddenly at some peak hour, the IO goes to 60 or = 70% and the machine becomes extremely slow. ZFS is all by default, except t= he sync parameter which is set disabled. Apart from that the ARC is limited= to 64GB. But even this is extremely odd. The used ARC is near 20GB. I have= seen, that meta cache in arc is very near to the limit that FreeBSD automa= tically sets depending on the size of the ARC you set. It seems that almost= all ARC is used by meta cache. I have seen this effect in all my mail serv= ers with this hardware and software config.

I do attach a zfs-= stats output, but from now that the servers are not so loaded as described= =2E I do explain. I run a couple of Cyrus instances in these servers. One a= s master, one as slave on each server. The commented situation from above, = happens when both Cyrus instances become master, so when we are using two C= yrus instances giving service in the same machine. For avoiding issues, kno= w we have balanced and we have a master and a slave in each server. You kno= w, a slave instance has almost no io and only a single connection for repli= cation. So the zfs-stats output is from now we have let's say half of load = in each server, because they have one master and one slave instance.
=
As said before, when I place two masters in same server, perhaps all= day works, but just at 11:00 am (for example) the IO goes to 60% (it doesn= 't increase) but it seems like if the IO where not being able to be served,= let's say more than a limit. More than a concrete io limit (I'd say 60%)= =2E

I don't really know if, perhaps the QVO technology could b= e the guilty here.... because... they say are desktop computers disks... bu= t later... I have get a nice performance when copying for instance mailboxe= s from five to five.... I can flood a gigabit interface when copying mailbo= xes between servers from five to five.... they seem to perform....
Could anyone please shed us some light in this issue?. I don't really = know what to think.

Best regards,
 


--=_c643a5b1c3a123a0b3bc0c046160d9bb Content-Transfer-Encoding: base64 Content-ID: Content-Type: image/gif; name=d8974688.gif Content-Disposition: inline; filename=d8974688.gif; size=42 R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 --=_c643a5b1c3a123a0b3bc0c046160d9bb-- --=_044aea7dd22a9f76a92b49bbdb55310b-- --=_bd91ad01df27068b5d25c9b898f7c3df Content-Transfer-Encoding: base64 Content-Type: text/plain; name=zfs-stats.txt Content-Disposition: attachment; filename=zfs-stats.txt; size=12369 L3RtcC96ZnMtc3RhdHMgLWEKCi0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQpaRlMgU3Vic3lzdGVtIFJlcG9ydAkJ CQlXZWQgQXByICA2IDExOjU4OjE4IDIwMjIKLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCgpTeXN0ZW0gSW5mb3Jt YXRpb246CgoJS2VybmVsIFZlcnNpb246CQkJCTEyMDIwMDAgKG9zcmVsZGF0ZSkKCUhhcmR3YXJl IFBsYXRmb3JtOgkJCWFtZDY0CglQcm9jZXNzb3IgQXJjaGl0ZWN0dXJlOgkJCWFtZDY0CgoJWkZT IFN0b3JhZ2UgcG9vbCBWZXJzaW9uOgkJNTAwMAoJWkZTIEZpbGVzeXN0ZW0gVmVyc2lvbjoJCQk1 CgpGcmVlQlNEIDEyLjItUkVMRUFTRS1wNiByMzY5ODU1IEdFTkVSSUMgMTE6NThBTSAgdXAgMTQ4 IGRheXMsIDIwOjI5LCAxIHVzZXIsIGxvYWQgYXZlcmFnZXM6IDIuMTEsIDIuNDcsIDIuMjYKCi0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLQoKU3lzdGVtIE1lbW9yeToKCgkzLjI0JQk0LjAzCUdpQiBBY3RpdmUsCTcx LjUwJQk4OC45MwlHaUIgSW5hY3QKCTIyLjc4JQkyOC4zMwlHaUIgV2lyZWQsCTAuMDAlCTAJQnl0 ZXMgQ2FjaGUKCTIuMjUlCTIuODAJR2lCIEZyZWUsCTAuMjMlCTI5Mi42MQlNaUIgR2FwCgoJUmVh bCBJbnN0YWxsZWQ6CQkJCTEyOC4wMAlHaUIKCVJlYWwgQXZhaWxhYmxlOgkJCTk5LjcwJQkxMjcu NjIJR2lCCglSZWFsIE1hbmFnZWQ6CQkJOTcuNDclCTEyNC4zOQlHaUIKCglMb2dpY2FsIFRvdGFs OgkJCQkxMjguMDAJR2lCCglMb2dpY2FsIFVzZWQ6CQkJMjguMzMlCTM2LjI2CUdpQgoJTG9naWNh bCBGcmVlOgkJCTcxLjY3JQk5MS43NAlHaUIKCktlcm5lbCBNZW1vcnk6CQkJCQkzLjIxCUdpQgoJ RGF0YToJCQkJOTguODIlCTMuMTcJR2lCCglUZXh0OgkJCQkxLjE4JQkzOC42NQlNaUIKCktlcm5l bCBNZW1vcnkgTWFwOgkJCQkxMjQuMzkJR2lCCglTaXplOgkJCQkyMi4zNCUJMjcuNzkJR2lCCglG cmVlOgkJCQk3Ny42NiUJOTYuNjAJR2lCCgotLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KCkFSQyBTdW1tYXJ5OiAo SEVBTFRIWSkKCU1lbW9yeSBUaHJvdHRsZSBDb3VudDoJCQkwCgpBUkMgTWlzYzoKCURlbGV0ZWQ6 CQkJCTEuNzgJYgoJTXV0ZXggTWlzc2VzOgkJCQkxODcuMjUJbQoJRXZpY3QgU2tpcHM6CQkJCTg3 LjM1CWIKCkFSQyBTaXplOgkJCQkyOS4yNSUJMTguNzIJR2lCCglUYXJnZXQgU2l6ZTogKEFkYXB0 aXZlKQkJMTIuNTAlCTguMDAJR2lCCglNaW4gU2l6ZSAoSGFyZCBMaW1pdCk6CQkxMi41MCUJOC4w MAlHaUIKCU1heCBTaXplIChIaWdoIFdhdGVyKToJCTg6MQk2NC4wMAlHaUIKCURlY29tcHJlc3Nl ZCBEYXRhIFNpemU6CQkJOS41MglHaUIKCUNvbXByZXNzaW9uIEZhY3RvcjoJCQkwLjUxCgpBUkMg U2l6ZSBCcmVha2Rvd246CglSZWNlbnRseSBVc2VkIENhY2hlIFNpemU6CTIuNjclCTUxMi4wMAlN aUIKCUZyZXF1ZW50bHkgVXNlZCBDYWNoZSBTaXplOgk5Ny4zMyUJMTguMjIJR2lCCgpBUkMgSGFz aCBCcmVha2Rvd246CglFbGVtZW50cyBNYXg6CQkJCTMuMTEJbQoJRWxlbWVudHMgQ3VycmVudDoJ CTE5LjIxJQk1OTYuNjIJawoJQ29sbGlzaW9uczoJCQkJMjAzLjY0CW0KCUNoYWluIE1heDoJCQkJ NQoJQ2hhaW5zOgkJCQkJMTAuMjMJawoKLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tCgpBUkMgRWZmaWNpZW5jeToJ CQkJCTE4OC4wNgliCglDYWNoZSBIaXQgUmF0aW86CQk5OC42MyUJMTg1LjQ4CWIKCUNhY2hlIE1p c3MgUmF0aW86CQkxLjM3JQkyLjU4CWIKCUFjdHVhbCBIaXQgUmF0aW86CQk5OC42MiUJMTg1LjQ2 CWIKCglEYXRhIERlbWFuZCBFZmZpY2llbmN5OgkJOTguNDQlCTU3LjM2CWIKCURhdGEgUHJlZmV0 Y2ggRWZmaWNpZW5jeToJMi40NSUJODU5LjAzCW0KCglDQUNIRSBISVRTIEJZIENBQ0hFIExJU1Q6 CgkgIE1vc3QgUmVjZW50bHkgVXNlZDoJCTEwLjg5JQkyMC4yMAliCgkgIE1vc3QgRnJlcXVlbnRs eSBVc2VkOgkJODkuMTAlCTE2NS4yNgliCgkgIE1vc3QgUmVjZW50bHkgVXNlZCBHaG9zdDoJMC4w NyUJMTM4LjcwCW0KCSAgTW9zdCBGcmVxdWVudGx5IFVzZWQgR2hvc3Q6CTAuMTYlCTMwMC44NAlt CgoJQ0FDSEUgSElUUyBCWSBEQVRBIFRZUEU6CgkgIERlbWFuZCBEYXRhOgkJCTMwLjQ0JQk1Ni40 NgliCgkgIFByZWZldGNoIERhdGE6CQkwLjAxJQkyMS4wMQltCgkgIERlbWFuZCBNZXRhZGF0YToJ CTY5LjU0JQkxMjguOTgJYgoJICBQcmVmZXRjaCBNZXRhZGF0YToJCTAuMDElCTE2LjA5CW0KCglD QUNIRSBNSVNTRVMgQlkgREFUQSBUWVBFOgoJICBEZW1hbmQgRGF0YToJCQkzNC43NSUJODk2LjY3 CW0KCSAgUHJlZmV0Y2ggRGF0YToJCTMyLjQ3JQk4MzguMDIJbQoJICBEZW1hbmQgTWV0YWRhdGE6 CQkzMS4xOCUJODA0LjU3CW0KCSAgUHJlZmV0Y2ggTWV0YWRhdGE6CQkxLjYwJQk0MS4yNwltCgot LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0KCkwyQVJDIGlzIGRpc2FibGVkCgotLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KClBlciBk YXRhc2V0IHN0YXRpc3RpY3MgYXJlIG5vdCBhdmFpbGFibGUKCi0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLQoKRmls ZS1MZXZlbCBQcmVmZXRjaDoKCkRNVSBFZmZpY2llbmN5OgkJCQkJODMuNTIJYgoJSGl0IFJhdGlv OgkJCTIuNzQlCTIuMjkJYgoJTWlzcyBSYXRpbzoJCQk5Ny4yNiUJODEuMjMJYgoKLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tCgpWREVWIGNhY2hlIGlzIGRpc2FibGVkCgotLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0KClpGUyBUdW5h YmxlcyAoc3lzY3RsKToKCWtlcm4ubWF4dXNlcnMgICAgICAgICAgICAgICAgICAgICAgICAgICA4 NTAzCgl2bS5rbWVtX3NpemUgICAgICAgICAgICAgICAgICAgICAgICAgICAgMTMzNTYwMzA3NzEy Cgl2bS5rbWVtX3NpemVfc2NhbGUgICAgICAgICAgICAgICAgICAgICAgMQoJdm0ua21lbV9zaXpl X21pbiAgICAgICAgICAgICAgICAgICAgICAgIDAKCXZtLmttZW1fc2l6ZV9tYXggICAgICAgICAg ICAgICAgICAgICAgICAxMzE5NDEzOTUwODc0Cgl2ZnMuemZzLnRyaW0ubWF4X2ludGVydmFsICAg ICAgICAgICAgICAgMQoJdmZzLnpmcy50cmltLnRpbWVvdXQgICAgICAgICAgICAgICAgICAgIDMw Cgl2ZnMuemZzLnRyaW0udHhnX2RlbGF5ICAgICAgICAgICAgICAgICAgMzIKCXZmcy56ZnMudHJp bS5lbmFibGVkICAgICAgICAgICAgICAgICAgICAxCgl2ZnMuemZzLnZvbC5pbW1lZGlhdGVfd3Jp dGVfc3ogICAgICAgICAgMzI3NjgKCXZmcy56ZnMudm9sLnVubWFwX3N5bmNfZW5hYmxlZCAgICAg ICAgICAwCgl2ZnMuemZzLnZvbC51bm1hcF9lbmFibGVkICAgICAgICAgICAgICAgMQoJdmZzLnpm cy52b2wucmVjdXJzaXZlICAgICAgICAgICAgICAgICAgIDAKCXZmcy56ZnMudm9sLm1vZGUgICAg ICAgICAgICAgICAgICAgICAgICAxCgl2ZnMuemZzLnZlcnNpb24uenBsICAgICAgICAgICAgICAg ICAgICAgNQoJdmZzLnpmcy52ZXJzaW9uLnNwYSAgICAgICAgICAgICAgICAgICAgIDUwMDAKCXZm cy56ZnMudmVyc2lvbi5hY2wgICAgICAgICAgICAgICAgICAgICAxCgl2ZnMuemZzLnZlcnNpb24u aW9jdGwgICAgICAgICAgICAgICAgICAgNwoJdmZzLnpmcy5kZWJ1ZyAgICAgICAgICAgICAgICAg ICAgICAgICAgIDAKCXZmcy56ZnMuc3VwZXJfb3duZXIgICAgICAgICAgICAgICAgICAgICAwCgl2 ZnMuemZzLmltbWVkaWF0ZV93cml0ZV9zeiAgICAgICAgICAgICAgMzI3NjgKCXZmcy56ZnMuc3lu Y19wYXNzX3Jld3JpdGUgICAgICAgICAgICAgICAyCgl2ZnMuemZzLnN5bmNfcGFzc19kb250X2Nv bXByZXNzICAgICAgICAgNQoJdmZzLnpmcy5zeW5jX3Bhc3NfZGVmZXJyZWRfZnJlZSAgICAgICAg IDIKCXZmcy56ZnMuemlvLmR2YV90aHJvdHRsZV9lbmFibGVkICAgICAgICAxCgl2ZnMuemZzLnpp by5leGNsdWRlX21ldGFkYXRhICAgICAgICAgICAgMAoJdmZzLnpmcy56aW8udXNlX3VtYSAgICAg ICAgICAgICAgICAgICAgIDEKCXZmcy56ZnMuemlvLnRhc2txX2JhdGNoX3BjdCAgICAgICAgICAg ICA3NQoJdmZzLnpmcy56aWxfbWF4YmxvY2tzaXplICAgICAgICAgICAgICAgIDEzMTA3MgoJdmZz Lnpmcy56aWxfc2xvZ19idWxrICAgICAgICAgICAgICAgICAgIDc4NjQzMgoJdmZzLnpmcy56aWxf bm9jYWNoZWZsdXNoICAgICAgICAgICAgICAgIDAKCXZmcy56ZnMuemlsX3JlcGxheV9kaXNhYmxl ICAgICAgICAgICAgICAwCgl2ZnMuemZzLmNhY2hlX2ZsdXNoX2Rpc2FibGUgICAgICAgICAgICAg MAoJdmZzLnpmcy5zdGFuZGFyZF9zbV9ibGtzeiAgICAgICAgICAgICAgIDEzMTA3MgoJdmZzLnpm cy5kdGxfc21fYmxrc3ogICAgICAgICAgICAgICAgICAgIDQwOTYKCXZmcy56ZnMubWluX2F1dG9f YXNoaWZ0ICAgICAgICAgICAgICAgICAxMgoJdmZzLnpmcy5tYXhfYXV0b19hc2hpZnQgICAgICAg ICAgICAgICAgIDEzCgl2ZnMuemZzLnZkZXYudHJpbV9tYXhfcGVuZGluZyAgICAgICAgICAgMTAw MDAKCXZmcy56ZnMudmRldi5iaW9fZGVsZXRlX2Rpc2FibGUgICAgICAgICAwCgl2ZnMuemZzLnZk ZXYuYmlvX2ZsdXNoX2Rpc2FibGUgICAgICAgICAgMAoJdmZzLnpmcy52ZGV2LmRlZl9xdWV1ZV9k ZXB0aCAgICAgICAgICAgIDMyCgl2ZnMuemZzLnZkZXYucXVldWVfZGVwdGhfcGN0ICAgICAgICAg ICAgMTAwMAoJdmZzLnpmcy52ZGV2LndyaXRlX2dhcF9saW1pdCAgICAgICAgICAgIDQwOTYKCXZm cy56ZnMudmRldi5yZWFkX2dhcF9saW1pdCAgICAgICAgICAgICAzMjc2OAoJdmZzLnpmcy52ZGV2 LmFnZ3JlZ2F0aW9uX2xpbWl0X25vbl9yb3RhdGluZzEzMTA3MgoJdmZzLnpmcy52ZGV2LmFnZ3Jl Z2F0aW9uX2xpbWl0ICAgICAgICAgIDEwNDg1NzYKCXZmcy56ZnMudmRldi5pbml0aWFsaXppbmdf bWF4X2FjdGl2ZSAgICAxCgl2ZnMuemZzLnZkZXYuaW5pdGlhbGl6aW5nX21pbl9hY3RpdmUgICAg MQoJdmZzLnpmcy52ZGV2LnJlbW92YWxfbWF4X2FjdGl2ZSAgICAgICAgIDIKCXZmcy56ZnMudmRl di5yZW1vdmFsX21pbl9hY3RpdmUgICAgICAgICAxCgl2ZnMuemZzLnZkZXYudHJpbV9tYXhfYWN0 aXZlICAgICAgICAgICAgNjQKCXZmcy56ZnMudmRldi50cmltX21pbl9hY3RpdmUgICAgICAgICAg ICAxCgl2ZnMuemZzLnZkZXYuc2NydWJfbWF4X2FjdGl2ZSAgICAgICAgICAgMgoJdmZzLnpmcy52 ZGV2LnNjcnViX21pbl9hY3RpdmUgICAgICAgICAgIDEKCXZmcy56ZnMudmRldi5hc3luY193cml0 ZV9tYXhfYWN0aXZlICAgICAxMAoJdmZzLnpmcy52ZGV2LmFzeW5jX3dyaXRlX21pbl9hY3RpdmUg ICAgIDEKCXZmcy56ZnMudmRldi5hc3luY19yZWFkX21heF9hY3RpdmUgICAgICAzCgl2ZnMuemZz LnZkZXYuYXN5bmNfcmVhZF9taW5fYWN0aXZlICAgICAgMQoJdmZzLnpmcy52ZGV2LnN5bmNfd3Jp dGVfbWF4X2FjdGl2ZSAgICAgIDEwCgl2ZnMuemZzLnZkZXYuc3luY193cml0ZV9taW5fYWN0aXZl ICAgICAgMTAKCXZmcy56ZnMudmRldi5zeW5jX3JlYWRfbWF4X2FjdGl2ZSAgICAgICAxMAoJdmZz Lnpmcy52ZGV2LnN5bmNfcmVhZF9taW5fYWN0aXZlICAgICAgIDEwCgl2ZnMuemZzLnZkZXYubWF4 X2FjdGl2ZSAgICAgICAgICAgICAgICAgMTAwMAoJdmZzLnpmcy52ZGV2LmFzeW5jX3dyaXRlX2Fj dGl2ZV9tYXhfZGlydHlfcGVyY2VudDYwCgl2ZnMuemZzLnZkZXYuYXN5bmNfd3JpdGVfYWN0aXZl X21pbl9kaXJ0eV9wZXJjZW50MzAKCXZmcy56ZnMudmRldi5taXJyb3Iubm9uX3JvdGF0aW5nX3Nl ZWtfaW5jMQoJdmZzLnpmcy52ZGV2Lm1pcnJvci5ub25fcm90YXRpbmdfaW5jICAgIDAKCXZmcy56 ZnMudmRldi5taXJyb3Iucm90YXRpbmdfc2Vla19vZmZzZXQxMDQ4NTc2Cgl2ZnMuemZzLnZkZXYu bWlycm9yLnJvdGF0aW5nX3NlZWtfaW5jICAgNQoJdmZzLnpmcy52ZGV2Lm1pcnJvci5yb3RhdGlu Z19pbmMgICAgICAgIDAKCXZmcy56ZnMudmRldi50cmltX29uX2luaXQgICAgICAgICAgICAgICAx Cgl2ZnMuemZzLnZkZXYuY2FjaGUuYnNoaWZ0ICAgICAgICAgICAgICAgMTYKCXZmcy56ZnMudmRl di5jYWNoZS5zaXplICAgICAgICAgICAgICAgICAwCgl2ZnMuemZzLnZkZXYuY2FjaGUubWF4ICAg ICAgICAgICAgICAgICAgMTYzODQKCXZmcy56ZnMudmRldi52YWxpZGF0ZV9za2lwICAgICAgICAg ICAgICAwCgl2ZnMuemZzLnZkZXYubWF4X21zX3NoaWZ0ICAgICAgICAgICAgICAgMzQKCXZmcy56 ZnMudmRldi5kZWZhdWx0X21zX3NoaWZ0ICAgICAgICAgICAyOQoJdmZzLnpmcy52ZGV2Lm1heF9t c19jb3VudF9saW1pdCAgICAgICAgIDEzMTA3MgoJdmZzLnpmcy52ZGV2Lm1pbl9tc19jb3VudCAg ICAgICAgICAgICAgIDE2Cgl2ZnMuemZzLnZkZXYuZGVmYXVsdF9tc19jb3VudCAgICAgICAgICAg MjAwCgl2ZnMuemZzLnR4Zy50aW1lb3V0ICAgICAgICAgICAgICAgICAgICAgNQoJdmZzLnpmcy5z cGFjZV9tYXBfaWJzICAgICAgICAgICAgICAgICAgIDE0Cgl2ZnMuemZzLnNwZWNpYWxfY2xhc3Nf bWV0YWRhdGFfcmVzZXJ2ZV9wY3QyNQoJdmZzLnpmcy51c2VyX2luZGlyZWN0X2lzX3NwZWNpYWwg ICAgICAgIDEKCXZmcy56ZnMuZGR0X2RhdGFfaXNfc3BlY2lhbCAgICAgICAgICAgICAxCgl2ZnMu emZzLnNwYV9hbGxvY2F0b3JzICAgICAgICAgICAgICAgICAgNAoJdmZzLnpmcy5zcGFfbWluX3Ns b3AgICAgICAgICAgICAgICAgICAgIDEzNDIxNzcyOAoJdmZzLnpmcy5zcGFfc2xvcF9zaGlmdCAg ICAgICAgICAgICAgICAgIDUKCXZmcy56ZnMuc3BhX2FzaXplX2luZmxhdGlvbiAgICAgICAgICAg ICAyNAoJdmZzLnpmcy5kZWFkbWFuX2VuYWJsZWQgICAgICAgICAgICAgICAgIDEKCXZmcy56ZnMu ZGVhZG1hbl9jaGVja3RpbWVfbXMgICAgICAgICAgICA1MDAwCgl2ZnMuemZzLmRlYWRtYW5fc3lu Y3RpbWVfbXMgICAgICAgICAgICAgMTAwMDAwMAoJdmZzLnpmcy5kZWJ1Z2ZsYWdzICAgICAgICAg ICAgICAgICAgICAgIDAKCXZmcy56ZnMucmVjb3ZlciAgICAgICAgICAgICAgICAgICAgICAgICAw Cgl2ZnMuemZzLnNwYV9sb2FkX3ZlcmlmeV9kYXRhICAgICAgICAgICAgMQoJdmZzLnpmcy5zcGFf bG9hZF92ZXJpZnlfbWV0YWRhdGEgICAgICAgIDEKCXZmcy56ZnMuc3BhX2xvYWRfdmVyaWZ5X21h eGluZmxpZ2h0ICAgICAxMDAwMAoJdmZzLnpmcy5tYXhfbWlzc2luZ190dmRzX3NjYW4gICAgICAg ICAgIDAKCXZmcy56ZnMubWF4X21pc3NpbmdfdHZkc19jYWNoZWZpbGUgICAgICAyCgl2ZnMuemZz Lm1heF9taXNzaW5nX3R2ZHMgICAgICAgICAgICAgICAgMAoJdmZzLnpmcy5zcGFfbG9hZF9wcmlu dF92ZGV2X3RyZWUgICAgICAgIDAKCXZmcy56ZnMuY2N3X3JldHJ5X2ludGVydmFsICAgICAgICAg ICAgICAzMDAKCXZmcy56ZnMuY2hlY2tfaG9zdGlkICAgICAgICAgICAgICAgICAgICAxCgl2ZnMu emZzLm11bHRpaG9zdF9mYWlsX2ludGVydmFscyAgICAgICAgMTAKCXZmcy56ZnMubXVsdGlob3N0 X2ltcG9ydF9pbnRlcnZhbHMgICAgICAyMAoJdmZzLnpmcy5tdWx0aWhvc3RfaW50ZXJ2YWwgICAg ICAgICAgICAgIDEwMDAKCXZmcy56ZnMubWdfZnJhZ21lbnRhdGlvbl90aHJlc2hvbGQgICAgICA4 NQoJdmZzLnpmcy5tZ19ub2FsbG9jX3RocmVzaG9sZCAgICAgICAgICAgIDAKCXZmcy56ZnMuY29u ZGVuc2VfcGN0ICAgICAgICAgICAgICAgICAgICAyMDAKCXZmcy56ZnMubWV0YXNsYWJfc21fYmxr c3ogICAgICAgICAgICAgICA0MDk2Cgl2ZnMuemZzLm1ldGFzbGFiLmJpYXNfZW5hYmxlZCAgICAg ICAgICAgMQoJdmZzLnpmcy5tZXRhc2xhYi5sYmFfd2VpZ2h0aW5nX2VuYWJsZWQgIDEKCXZmcy56 ZnMubWV0YXNsYWIuZnJhZ21lbnRhdGlvbl9mYWN0b3JfZW5hYmxlZDEKCXZmcy56ZnMubWV0YXNs YWIucHJlbG9hZF9lbmFibGVkICAgICAgICAxCgl2ZnMuemZzLm1ldGFzbGFiLnByZWxvYWRfbGlt aXQgICAgICAgICAgMwoJdmZzLnpmcy5tZXRhc2xhYi51bmxvYWRfZGVsYXkgICAgICAgICAgIDgK CXZmcy56ZnMubWV0YXNsYWIubG9hZF9wY3QgICAgICAgICAgICAgICA1MAoJdmZzLnpmcy5tZXRh c2xhYi5taW5fYWxsb2Nfc2l6ZSAgICAgICAgIDMzNTU0NDMyCgl2ZnMuemZzLm1ldGFzbGFiLmRm X2ZyZWVfcGN0ICAgICAgICAgICAgNAoJdmZzLnpmcy5tZXRhc2xhYi5kZl9hbGxvY190aHJlc2hv bGQgICAgIDEzMTA3MgoJdmZzLnpmcy5tZXRhc2xhYi5kZWJ1Z191bmxvYWQgICAgICAgICAgIDAK CXZmcy56ZnMubWV0YXNsYWIuZGVidWdfbG9hZCAgICAgICAgICAgICAwCgl2ZnMuemZzLm1ldGFz bGFiLmZyYWdtZW50YXRpb25fdGhyZXNob2xkNzAKCXZmcy56ZnMubWV0YXNsYWIuZm9yY2VfZ2Fu Z2luZyAgICAgICAgICAxNjc3NzIxNwoJdmZzLnpmcy5mcmVlX2Jwb2JqX2VuYWJsZWQgICAgICAg ICAgICAgIDEKCXZmcy56ZnMuZnJlZV9tYXhfYmxvY2tzICAgICAgICAgICAgICAgICAtMQoJdmZz Lnpmcy56ZnNfc2Nhbl9jaGVja3BvaW50X2ludGVydmFsICAgIDcyMDAKCXZmcy56ZnMuemZzX3Nj YW5fbGVnYWN5ICAgICAgICAgICAgICAgICAwCgl2ZnMuemZzLm5vX3NjcnViX3ByZWZldGNoICAg ICAgICAgICAgICAgMAoJdmZzLnpmcy5ub19zY3J1Yl9pbyAgICAgICAgICAgICAgICAgICAgIDAK CXZmcy56ZnMucmVzaWx2ZXJfbWluX3RpbWVfbXMgICAgICAgICAgICAzMDAwCgl2ZnMuemZzLmZy ZWVfbWluX3RpbWVfbXMgICAgICAgICAgICAgICAgMTAwMAoJdmZzLnpmcy5zY2FuX21pbl90aW1l X21zICAgICAgICAgICAgICAgIDEwMDAKCXZmcy56ZnMuc2Nhbl9pZGxlICAgICAgICAgICAgICAg ICAgICAgICA1MAoJdmZzLnpmcy5zY3J1Yl9kZWxheSAgICAgICAgICAgICAgICAgICAgIDQKCXZm cy56ZnMucmVzaWx2ZXJfZGVsYXkgICAgICAgICAgICAgICAgICAyCgl2ZnMuemZzLnpmZXRjaC5h cnJheV9yZF9zeiAgICAgICAgICAgICAgMTA0ODU3NgoJdmZzLnpmcy56ZmV0Y2gubWF4X2lkaXN0 YW5jZSAgICAgICAgICAgIDY3MTA4ODY0Cgl2ZnMuemZzLnpmZXRjaC5tYXhfZGlzdGFuY2UgICAg ICAgICAgICAgODM4ODYwOAoJdmZzLnpmcy56ZmV0Y2gubWluX3NlY19yZWFwICAgICAgICAgICAg IDIKCXZmcy56ZnMuemZldGNoLm1heF9zdHJlYW1zICAgICAgICAgICAgICA4Cgl2ZnMuemZzLnBy ZWZldGNoX2Rpc2FibGUgICAgICAgICAgICAgICAgMAoJdmZzLnpmcy5kZWxheV9zY2FsZSAgICAg ICAgICAgICAgICAgICAgIDUwMDAwMAoJdmZzLnpmcy5kZWxheV9taW5fZGlydHlfcGVyY2VudCAg ICAgICAgIDYwCgl2ZnMuemZzLmRpcnR5X2RhdGFfc3luY19wY3QgICAgICAgICAgICAgMjAKCXZm cy56ZnMuZGlydHlfZGF0YV9tYXhfcGVyY2VudCAgICAgICAgICAxMAoJdmZzLnpmcy5kaXJ0eV9k YXRhX21heF9tYXggICAgICAgICAgICAgIDQyOTQ5NjcyOTYKCXZmcy56ZnMuZGlydHlfZGF0YV9t YXggICAgICAgICAgICAgICAgICA0Mjk0OTY3Mjk2Cgl2ZnMuemZzLm1heF9yZWNvcmRzaXplICAg ICAgICAgICAgICAgICAgMTA0ODU3NgoJdmZzLnpmcy5kZWZhdWx0X2licyAgICAgICAgICAgICAg ICAgICAgIDE3Cgl2ZnMuemZzLmRlZmF1bHRfYnMgICAgICAgICAgICAgICAgICAgICAgOQoJdmZz Lnpmcy5zZW5kX2hvbGVzX3dpdGhvdXRfYmlydGhfdGltZSAgIDEKCXZmcy56ZnMubWRjb21wX2Rp c2FibGUgICAgICAgICAgICAgICAgICAwCgl2ZnMuemZzLnBlcl90eGdfZGlydHlfZnJlZXNfcGVy Y2VudCAgICAgNQoJdmZzLnpmcy5ub3B3cml0ZV9lbmFibGVkICAgICAgICAgICAgICAgIDEKCXZm cy56ZnMuZGVkdXAucHJlZmV0Y2ggICAgICAgICAgICAgICAgICAxCgl2ZnMuemZzLmRidWZfY2Fj aGVfbG93YXRlcl9wY3QgICAgICAgICAgMTAKCXZmcy56ZnMuZGJ1Zl9jYWNoZV9oaXdhdGVyX3Bj dCAgICAgICAgICAxMAoJdmZzLnpmcy5kYnVmX21ldGFkYXRhX2NhY2hlX292ZXJmbG93ICAgIDAK CXZmcy56ZnMuZGJ1Zl9tZXRhZGF0YV9jYWNoZV9zaGlmdCAgICAgICA2Cgl2ZnMuemZzLmRidWZf Y2FjaGVfc2hpZnQgICAgICAgICAgICAgICAgNQoJdmZzLnpmcy5kYnVmX21ldGFkYXRhX2NhY2hl X21heF9ieXRlcyAgIDEwNzM3NDE4MjQKCXZmcy56ZnMuZGJ1Zl9jYWNoZV9tYXhfYnl0ZXMgICAg ICAgICAgICAyMTQ3NDgzNjQ4Cgl2ZnMuemZzLmFyY19taW5fcHJlc2NpZW50X3ByZWZldGNoX21z ICAgNgoJdmZzLnpmcy5hcmNfbWluX3ByZWZldGNoX21zICAgICAgICAgICAgIDEKCXZmcy56ZnMu bDJjX29ubHlfc2l6ZSAgICAgICAgICAgICAgICAgICAwCgl2ZnMuemZzLm1mdV9naG9zdF9kYXRh X2VzaXplICAgICAgICAgICAgMTQ2NjE1MzQ3MgoJdmZzLnpmcy5tZnVfZ2hvc3RfbWV0YWRhdGFf ZXNpemUgICAgICAgIDcwNzIzODA0MTYKCXZmcy56ZnMubWZ1X2dob3N0X3NpemUgICAgICAgICAg ICAgICAgICA4NTM4NTMzODg4Cgl2ZnMuemZzLm1mdV9kYXRhX2VzaXplICAgICAgICAgICAgICAg ICAgMAoJdmZzLnpmcy5tZnVfbWV0YWRhdGFfZXNpemUgICAgICAgICAgICAgIDAKCXZmcy56ZnMu bWZ1X3NpemUgICAgICAgICAgICAgICAgICAgICAgICA5ODMxMjcwNDAKCXZmcy56ZnMubXJ1X2do b3N0X2RhdGFfZXNpemUgICAgICAgICAgICAwCgl2ZnMuemZzLm1ydV9naG9zdF9tZXRhZGF0YV9l c2l6ZSAgICAgICAgMAoJdmZzLnpmcy5tcnVfZ2hvc3Rfc2l6ZSAgICAgICAgICAgICAgICAgIDAK CXZmcy56ZnMubXJ1X2RhdGFfZXNpemUgICAgICAgICAgICAgICAgICAwCgl2ZnMuemZzLm1ydV9t ZXRhZGF0YV9lc2l6ZSAgICAgICAgICAgICAgMAoJdmZzLnpmcy5tcnVfc2l6ZSAgICAgICAgICAg ICAgICAgICAgICAgIDEyOTAzMzUxMjk2Cgl2ZnMuemZzLmFub25fZGF0YV9lc2l6ZSAgICAgICAg ICAgICAgICAgMAoJdmZzLnpmcy5hbm9uX21ldGFkYXRhX2VzaXplICAgICAgICAgICAgIDAKCXZm cy56ZnMuYW5vbl9zaXplICAgICAgICAgICAgICAgICAgICAgICAxMjUxMjI1NjAKCXZmcy56ZnMu bDJhcmNfbm9ydyAgICAgICAgICAgICAgICAgICAgICAxCgl2ZnMuemZzLmwyYXJjX2ZlZWRfYWdh aW4gICAgICAgICAgICAgICAgMQoJdmZzLnpmcy5sMmFyY19ub3ByZWZldGNoICAgICAgICAgICAg ICAgIDEKCXZmcy56ZnMubDJhcmNfZmVlZF9taW5fbXMgICAgICAgICAgICAgICAyMDAKCXZmcy56 ZnMubDJhcmNfZmVlZF9zZWNzICAgICAgICAgICAgICAgICAxCgl2ZnMuemZzLmwyYXJjX2hlYWRy b29tICAgICAgICAgICAgICAgICAgMgoJdmZzLnpmcy5sMmFyY193cml0ZV9ib29zdCAgICAgICAg ICAgICAgIDgzODg2MDgKCXZmcy56ZnMubDJhcmNfd3JpdGVfbWF4ICAgICAgICAgICAgICAgICA4 Mzg4NjA4Cgl2ZnMuemZzLmFyY19tZXRhX3N0cmF0ZWd5ICAgICAgICAgICAgICAgMAoJdmZzLnpm cy5hcmNfbWV0YV9saW1pdCAgICAgICAgICAgICAgICAgIDE3MTc5ODY5MTg0Cgl2ZnMuemZzLmFy Y19mcmVlX3RhcmdldCAgICAgICAgICAgICAgICAgNjk0NzI0Cgl2ZnMuemZzLmFyY19rbWVtX2Nh Y2hlX3JlYXBfcmV0cnlfbXMgICAgMTAwMAoJdmZzLnpmcy5jb21wcmVzc2VkX2FyY19lbmFibGVk ICAgICAgICAgIDEKCXZmcy56ZnMuYXJjX2dyb3dfcmV0cnkgICAgICAgICAgICAgICAgICA2MAoJ dmZzLnpmcy5hcmNfc2hyaW5rX3NoaWZ0ICAgICAgICAgICAgICAgIDcKCXZmcy56ZnMuYXJjX2F2 ZXJhZ2VfYmxvY2tzaXplICAgICAgICAgICA4MTkyCgl2ZnMuemZzLmFyY19ub19ncm93X3NoaWZ0 ICAgICAgICAgICAgICAgNQoJdmZzLnpmcy5hcmNfbWluICAgICAgICAgICAgICAgICAgICAgICAg IDg1ODk5MzQ1OTIKCXZmcy56ZnMuYXJjX21heCAgICAgICAgICAgICAgICAgICAgICAgICA2ODcx OTQ3NjczNgoJdmZzLnpmcy5hYmRfY2h1bmtfc2l6ZSAgICAgICAgICAgICAgICAgIDQwOTYKCXZm cy56ZnMuYWJkX3NjYXR0ZXJfZW5hYmxlZCAgICAgICAgICAgICAxCgotLS0tLS0tLS0tLS0tLS0t LS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0K --=_bd91ad01df27068b5d25c9b898f7c3df-- From nobody Wed Apr 6 11:28:45 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 966A01A81E70; Wed, 6 Apr 2022 11:28:48 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYMhv4tm0z4gLP; Wed, 6 Apr 2022 11:28:47 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id D560C60C055; Wed, 6 Apr 2022 13:28:45 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_2094f57367545a643dbb74ca4f8ba24d" Date: Wed, 06 Apr 2022 13:28:45 +0200 From: egoitz@ramattack.net To: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org Subject: Re: Desperate with 870 QVO and ZFS In-Reply-To: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> Message-ID: X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYMhv4tm0z4gLP X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCPT_COUNT_THREE(0.00)[3]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24:c]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[multipart/alternative,text/plain,multipart/related]; TO_DN_NONE(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; RCVD_TLS_LAST(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:+,3:~,4:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_2094f57367545a643dbb74ca4f8ba24d Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 The most extrange thing is... When machine boots ARC is in 40 value of GB used (for instance), but later decreases to 20GB (and this is not an example... is exact) in all my servers.... it's like if the ARC metadata which is more or less 17GB would limite the whole ARC..... With the traffic of this machines, it should I suppose the ARC should be larger than it is... and ARC in loader.conf is limited to 64GB (the half the ram this machines have) El 2022-04-06 13:15, egoitz@ramattack.net escribió: > ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > Good morning, > > I write this post with the expectation that perhaps someone could help me > > I am running some mail servers with FreeBSD and ZFS. They use 870 QVO (not EVO or other Samsung SSD disks) disks as storage. They can easily have from 1500 to 2000 concurrent connections. The machines have 128GB of ram and the CPU is almost absolutely idle. The disk IO is normally at 30 or 40% percent at most. > > The problem I'm facing is that they could be running just fine and suddenly at some peak hour, the IO goes to 60 or 70% and the machine becomes extremely slow. ZFS is all by default, except the sync parameter which is set disabled. Apart from that the ARC is limited to 64GB. But even this is extremely odd. The used ARC is near 20GB. I have seen, that meta cache in arc is very near to the limit that FreeBSD automatically sets depending on the size of the ARC you set. It seems that almost all ARC is used by meta cache. I have seen this effect in all my mail servers with this hardware and software config. > > I do attach a zfs-stats output, but from now that the servers are not so loaded as described. I do explain. I run a couple of Cyrus instances in these servers. One as master, one as slave on each server. The commented situation from above, happens when both Cyrus instances become master, so when we are using two Cyrus instances giving service in the same machine. For avoiding issues, know we have balanced and we have a master and a slave in each server. You know, a slave instance has almost no io and only a single connection for replication. So the zfs-stats output is from now we have let's say half of load in each server, because they have one master and one slave instance. > > As said before, when I place two masters in same server, perhaps all day works, but just at 11:00 am (for example) the IO goes to 60% (it doesn't increase) but it seems like if the IO where not being able to be served, let's say more than a limit. More than a concrete io limit (I'd say 60%). > > I don't really know if, perhaps the QVO technology could be the guilty here.... because... they say are desktop computers disks... but later... I have get a nice performance when copying for instance mailboxes from five to five.... I can flood a gigabit interface when copying mailboxes between servers from five to five.... they seem to perform.... > > Could anyone please shed us some light in this issue?. I don't really know what to think. > > Best regards, > > ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. --=_2094f57367545a643dbb74ca4f8ba24d Content-Type: multipart/related; boundary="=_6e3b5663f91f7c881feb9dbfb751600c" --=_6e3b5663f91f7c881feb9dbfb751600c Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

The most extrange thing is... When machine boots ARC is = in 40 value of GB used (for instance), but later decreases to 20GB (and thi= s is not an example... is exact) in all my servers.... it's like if the ARC= metadata which is more or less 17GB would limite the whole ARC.....

 

With the traffic of this machines, it should I suppose t= he ARC should be larger than it is... and ARC in loader.conf is limited to = 64GB (the half the ram this machines have)

 


El 2022-04-06 13:15, egoitz@ramattack.net escribió:


ATENCION: Este correo se ha enviado = desde fuera de la organización. No pinche en los enlaces ni abra los= adjuntos a no ser que reconozca el remitente y sepa que el contenido es se= guro.

Good morning,

I write this post with = the expectation that perhaps someone could help me 3D":)"

I am running = some mail servers with FreeBSD and ZFS. They use 870 QVO (not EVO or other = Samsung SSD disks) disks as storage. They can easily have from 1500 to 2000= concurrent connections. The machines have 128GB of ram and the CPU is almo= st absolutely idle. The disk IO is normally at 30 or 40% percent at most.
The problem I'm facing is that they could be running just fine = and suddenly at some peak hour, the IO goes to 60 or 70% and the machine be= comes extremely slow. ZFS is all by default, except the sync parameter whic= h is set disabled. Apart from that the ARC is limited to 64GB. But even thi= s is extremely odd. The used ARC is near 20GB. I have seen, that meta cache= in arc is very near to the limit that FreeBSD automatically sets depending= on the size of the ARC you set. It seems that almost all ARC is used by me= ta cache. I have seen this effect in all my mail servers with this hardware= and software config.

I do attach a zfs-stats output, but from= now that the servers are not so loaded as described. I do explain. I run a= couple of Cyrus instances in these servers. One as master, one as slave on= each server. The commented situation from above, happens when both Cyrus i= nstances become master, so when we are using two Cyrus instances giving ser= vice in the same machine. For avoiding issues, know we have balanced and we= have a master and a slave in each server. You know, a slave instance has a= lmost no io and only a single connection for replication. So the zfs-stats = output is from now we have let's say half of load in each server, because t= hey have one master and one slave instance.

As said before, wh= en I place two masters in same server, perhaps all day works, but just at 1= 1:00 am (for example) the IO goes to 60% (it doesn't increase) but it seems= like if the IO where not being able to be served, let's say more than a li= mit. More than a concrete io limit (I'd say 60%).

I don't real= ly know if, perhaps the QVO technology could be the guilty here.... because= =2E.. they say are desktop computers disks... but later... I have get a nic= e performance when copying for instance mailboxes from five to five.... I c= an flood a gigabit interface when copying mailboxes between servers from fi= ve to five.... they seem to perform....

Could anyone please sh= ed us some light in this issue?. I don't really know what to think.
<= br /> Best regards,
 




ATENCION: Este correo se ha enviado = desde fuera de la organización. No pinche en los enlaces ni abra los= adjuntos a no ser que reconozca el remitente y sepa que el contenido es se= guro.
--=_6e3b5663f91f7c881feb9dbfb751600c Content-Transfer-Encoding: base64 Content-ID: <1649244525624d796dc0952013070694@ramattack.net> Content-Type: image/gif; name=d8974688.gif Content-Disposition: inline; filename=d8974688.gif; size=42 R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7 --=_6e3b5663f91f7c881feb9dbfb751600c-- --=_2094f57367545a643dbb74ca4f8ba24d-- From nobody Wed Apr 6 14:36:42 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 2BB861A9766C; Wed, 6 Apr 2022 14:36:48 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYRsp5Qf7z3hrB; Wed, 6 Apr 2022 14:36:45 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id B81CB60C64B; Wed, 6 Apr 2022 16:36:42 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_99711a23deca8e31b3fd04d620ec3dd4" Date: Wed, 06 Apr 2022 16:36:42 +0200 From: egoitz@ramattack.net To: Rainer Duffner Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org Subject: Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> Message-ID: <0ef282aee34b441f1991334e2edbcaec@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYRsp5Qf7z3hrB X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_99711a23deca8e31b3fd04d620ec3dd4 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Rainer! Thank you so much for your help :) :) Well I assume they are in a datacenter and should not be a power outage.... About dataset size... yes... our ones are big... they can be 3-4 TB easily each dataset..... We bought them, because as they are for mailboxes and mailboxes grow and grow.... for having space for hosting them... We knew they had some speed issues, but those speed issues, we thought (as Samsung explains in the QVO site) they started after exceeding the speeding buffer this disks have. We though that meanwhile you didn't exceed it's capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps we were wrong?. Best regards, El 2022-04-06 14:56, Rainer Duffner escribió: >> Am 06.04.2022 um 13:15 schrieb egoitz@ramattack.net: >> I don't really know if, perhaps the QVO technology could be the guilty here.... because... they say are desktop computers disks... but later. > > Yeah, they are. > > Most likely, they don't have some sort of super-cap. > > A power-failure might totally toast the filesystem. > > These disks are - IMO - designed to accelerate read-operations. Their sustained write-performance is usually mediocre, at best. > > They might work well for small data-sets - because that is really written to some cache and the firmware just claims it's „written", but once the data-set becomes big enough, they are about as fast as a fast SATA-disk. > > https://www.tomshardware.com/reviews/samsung-970-evo-plus-ssd,5608.html --=_99711a23deca8e31b3fd04d620ec3dd4 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Rainer!


Thank you so much for your help :) :)

Well I assume they are in a datacenter and should not be a power outage= =2E...

About dataset size... yes... our ones are big... they can be 3-4 TB easi= ly each dataset.....

We bought them, because as they are for mailboxes and mailboxes grow and= grow.... for having space for hosting them...

We knew they had some speed issues, but those speed issues, we thought (= as Samsung explains in the QVO site) they started after exceeding the speed= ing buffer this disks have. We though that meanwhile you didn't exceed it's= capacity (the capacity of the speeding buffer) no speed problem arises. Pe= rhaps we were wrong?.


Best regards,



El 2022-04-06 14:56, Rainer Duffner escribió:



Am 06.04.2022 um 13:15 schrieb egoitz@ramattack.net:

I don't really know if, perhaps the QVO technolo= gy could be the guilty here.... because... they say are desktop computers d= isks... but later.

 
Yeah, they are.
 
Most likely, they don't have some sort of super-cap.
 
A power-failure might totally toast the filesystem.
 
These disks are - IMO -  designed to accelerate read-operations= =2E Their sustained write-performance is usually mediocre, at best.
 
They might work well for small data-sets - because that is really writ= ten to some cache and the firmware just claims it's „written", but on= ce the data-set becomes big enough, they are about as fast as a fast SATA-d= isk.
 
 
 
 
--=_99711a23deca8e31b3fd04d620ec3dd4-- From nobody Wed Apr 6 15:30:49 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 92CF11A83BBB; Wed, 6 Apr 2022 15:30:52 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYT4C3mk8z3svy; Wed, 6 Apr 2022 15:30:51 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id 6607560C676; Wed, 6 Apr 2022 17:30:49 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_aa8d0836ba0370d8efdf8758a7e3ea30" Date: Wed, 06 Apr 2022 17:30:49 +0200 From: egoitz@ramattack.net To: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance Subject: Re: Desperate with 870 QVO and ZFS In-Reply-To: <0ef282aee34b441f1991334e2edbcaec@ramattack.net> References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> Message-ID: <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYT4C3mk8z3svy X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCPT_COUNT_THREE(0.00)[3]; TO_DN_SOME(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_aa8d0836ba0370d8efdf8758a7e3ea30 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 One perhaps important note!! When this happens... almost all processes appear in top in the following state: txg state or txg-> bio.... perhaps should the the vfs.zfs.dirty_data_max, vfs.zfs.txg.timeout, vfs.zfs.vdev.async_write_active_max_dirty_percent be increased, decreased.... I'm afraid of doing some chage ana finally ending up with an inestable server.... I'm not an expert in handling these values.... Any recommendation?. Best regards, El 2022-04-06 16:36, egoitz@ramattack.net escribió: > ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > Hi Rainer! > > Thank you so much for your help :) :) > > Well I assume they are in a datacenter and should not be a power outage.... > > About dataset size... yes... our ones are big... they can be 3-4 TB easily each dataset..... > > We bought them, because as they are for mailboxes and mailboxes grow and grow.... for having space for hosting them... > > We knew they had some speed issues, but those speed issues, we thought (as Samsung explains in the QVO site) they started after exceeding the speeding buffer this disks have. We though that meanwhile you didn't exceed it's capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps we were wrong?. > > Best regards, > > El 2022-04-06 14:56, Rainer Duffner escribió: > > Am 06.04.2022 um 13:15 schrieb egoitz@ramattack.net: > I don't really know if, perhaps the QVO technology could be the guilty here.... because... they say are desktop computers disks... but later. > > Yeah, they are. > > Most likely, they don't have some sort of super-cap. > > A power-failure might totally toast the filesystem. > > These disks are - IMO - designed to accelerate read-operations. Their sustained write-performance is usually mediocre, at best. > > They might work well for small data-sets - because that is really written to some cache and the firmware just claims it's „written", but once the data-set becomes big enough, they are about as fast as a fast SATA-disk. > > https://www.tomshardware.com/reviews/samsung-970-evo-plus-ssd,5608.html --=_aa8d0836ba0370d8efdf8758a7e3ea30 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

One perhaps important note!!


When this happens... almost all processes appear in top in the following= state:


txg state or

txg->

bio....


perhaps should the the vfs.zfs.dirty_data_max, vfs.zfs.txg.timeout, vfs= =2Ezfs.vdev.async_write_active_max_dirty_percent be increased, decreased.= =2E.. I'm afraid of doing some chage ana finally ending up with an inestabl= e server.... I'm not an expert in handling these values....


Any recommendation?.


Best regards,

 


El 2022-04-06 16:36, egoitz@ramattack.net escribió:


ATENCION: Este correo se ha enviado = desde fuera de la organización. No pinche en los enlaces ni abra los= adjuntos a no ser que reconozca el remitente y sepa que el contenido es se= guro.

Hi Rainer!


Thank you so much for your help :) :)

Well I assume they are in a datacenter and should not be a power outage= =2E...

About dataset size... yes... our ones are big... they can be 3-4 TB easi= ly each dataset.....

We bought them, because as they are for mailboxes and mailboxes grow and= grow.... for having space for hosting them...

We knew they had some speed issues, but those speed issues, we thought (= as Samsung explains in the QVO site) they started after exceeding the speed= ing buffer this disks have. We though that meanwhile you didn't exceed it's= capacity (the capacity of the speeding buffer) no speed problem arises. Pe= rhaps we were wrong?.


Best regards,



El 2022-04-06 14:56, Rainer Duffner escribió:



Am 06.04.2022 um 13:15 schrieb egoitz@ramattack.net:

I don't really know if, perhaps the QVO technolo= gy could be the guilty here.... because... they say are desktop computers d= isks... but later.

 
Yeah, they are.
 
Most likely, they don't have some sort of super-cap.
 
A power-failure might totally toast the filesystem.
 
These disks are - IMO -  designed to accelerate read-operations= =2E Their sustained write-performance is usually mediocre, at best.
 
They might work well for small data-sets - because that is really writ= ten to some cache and the firmware just claims it's „written", but on= ce the data-set becomes big enough, they are about as fast as a fast SATA-d= isk.
 
 
 
 
--=_aa8d0836ba0370d8efdf8758a7e3ea30-- From nobody Wed Apr 6 15:43:27 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 0B3581A873E2; Wed, 6 Apr 2022 15:43:37 +0000 (UTC) (envelope-from se@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYTLw164tz4Rt7; Wed, 6 Apr 2022 15:43:36 +0000 (UTC) (envelope-from se@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649259816; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DszpBHls3d9k0db5g8HTfayx6XUbGeuJVLcEKqu7X00=; b=Gc5xuom5FY9xOpw9pbd1F6xAo9h6fFjjX6ZZa0srj87yA97Uu0xMLJwlwn29DiJqmS9lUW 6rBc9B/LUBjAL9JHluoyLecZG7WPTnLnLEPPGAvYvf3YpF6hdPZtLdyJnIacTrcTWqZecc jGl7MrQSGn3OkOQyrus9vri9gEPhUe698IJAZIKrSzcwCtA9Q78KvAXoAIRSu88kvfGct8 BeknFduPoU5/w4PknQGlUiHqCOszLY2/XqhVi2EZFXmUCMgfXRKj8tWsSxjC1cu0WihEtp 1PpW9FVAy00CMNWFZgSnHOoRBHw3G/TVcikzZf6ozAsF+20TZoWPEDas69Z0Lw== Received: from [IPV6:2003:cd:5f22:6f00:a5c2:9043:ac1c:2c06] (p200300cd5f226f00a5c29043ac1c2c06.dip0.t-ipconnect.de [IPv6:2003:cd:5f22:6f00:a5c2:9043:ac1c:2c06]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: se/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id C2DF02234; Wed, 6 Apr 2022 15:43:32 +0000 (UTC) (envelope-from se@FreeBSD.org) Message-ID: Date: Wed, 6 Apr 2022 17:43:27 +0200 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: Desperate with 870 QVO and ZFS Content-Language: en-US To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> From: Stefan Esser In-Reply-To: <0ef282aee34b441f1991334e2edbcaec@ramattack.net> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------xWEDst9IZgYvLPukN2ga6r9b" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649259816; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=DszpBHls3d9k0db5g8HTfayx6XUbGeuJVLcEKqu7X00=; b=aN4rR0ArnFTHAcbxMwgavfJPkS5+OKQlPfEt1Et1uxmcF+JlZXt0XA3/T7la/p1horP/zj WOseXuHjUAkNWnLPN2aeh5cq+TTuB3/9XrhmOF3NDjuj59dF5ikvm0pvAhqZ6GfbWNxFTD zIPeu0BwVtRLTujVxtYvsWhk9syiPIrVfmjaL40GxcnIuwocG2gMs0ygkn9P5okXCX+ZTy ztX0bnRWvrPKSQjztLOY6TiTSagOw8H6L9InjiPlXdF3TsJxxuK+LRecN9+90HbGQRHzxb 8lyj0IQtmdFmSoZI7o0sHk0TkiBus9KSDJ1ITSaYRKyPkCiw3BZ9TQDw2YHISQ== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1649259816; a=rsa-sha256; cv=none; b=Kh0/OSy2P0jg8Ud2tGvmRxYZEOwkXWAVZpjm83vZci17DkOsyBf8dW34XMZ5BukYz2EVBV ogoweL4BZh0pL34FGqJ8cL0/CioGC8N/qydaux+MobxVFKyvDfX8fTOlR6QOclh225o63Z ubUAOVUytF5H+3ku//KvSceootMPYY1/UOOjPImCZsAjTiWXyz54Y2gKSyfAF0J7LG4bf0 fCiWAbNWnTWYlEfcUGE3SRnKQrPk+htaTT2APKWkRX5jGvDxlO3EYzHcQU7ie+qjRv8EuX 4kAmnhFgsgOiYSoCB2OSqti+/Y3ux1IFIZ5OV8+vv0n1muQusDn97SPRphb9Mw== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------xWEDst9IZgYvLPukN2ga6r9b Content-Type: multipart/mixed; boundary="------------gUL8upCKsfSinXvRen5NU9ab"; protected-headers="v1" From: Stefan Esser To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner Message-ID: Subject: Re: Desperate with 870 QVO and ZFS References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> In-Reply-To: <0ef282aee34b441f1991334e2edbcaec@ramattack.net> --------------gUL8upCKsfSinXvRen5NU9ab Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net: > Hi Rainer! >=20 > Thank you so much for your help :) :) >=20 > Well I assume they are in a datacenter and should not be a power outage= =2E... >=20 > About dataset size... yes... our ones are big... they can be 3-4 TB eas= ily each > dataset..... >=20 > We bought them, because as they are for mailboxes and mailboxes grow an= d > grow.... for having space for hosting them... Which mailbox format (e.g. mbox, maildir, ...) do you use? > We knew they had some speed issues, but those speed issues, we thought = (as > Samsung explains in the QVO site) they started after exceeding the spee= ding > buffer this disks have. We though that meanwhile you didn't exceed it's= > capacity (the capacity of the speeding buffer) no speed problem arises.= Perhaps > we were wrong?. These drives are meant for small loads in a typical PC use case, i.e. some installations of software in the few GB range, else only files of a few MB being written, perhaps an import of media files that range from tens to a few hundred MB at a time, but less often than once a day. As the SSD fills, the space available for the single level write cache gets smaller (on many SSDs, I have no numbers for this particular device), and thus the amount of data that can be written at single cell speed shrinks as the SSD gets full. I have just looked up the size of the SLC cache, it is specified to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB version, smaller models will have a smaller SLC cache). But after writing those few GB at a speed of some 500 MB/s (i.e. after 12 to 150 seconds), the drive will need several minutes to transfer those writes to the quad-level cells, and will operate at a fraction of the nominal performance during that time. (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the 2 TB model.) And cheap SSDs often have no RAM cache (not checked, but I'd be surprised if the QVO had one) and thus cannot keep bookkeeping date in such a cache, further limiting the performance under load. And the resilience (max. amount of data written over its lifetime) is also quite low - I hope those drives are used in some kind of RAID configuration. The 870 QVO is specified for 370 full capacity writes, i.e. 370 TB for the 1 TB model. That's still a few hundred GB a day - but only if the write amplification stays in a reasonable range ... --------------gUL8upCKsfSinXvRen5NU9ab-- --------------xWEDst9IZgYvLPukN2ga6r9b Content-Type: application/pgp-signature; name="OpenPGP_signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="OpenPGP_signature" -----BEGIN PGP SIGNATURE----- wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmJNtR8FAwAAAAAACgkQR+u171r99UQX jggAh1PLi41CMsG6xbRvf9KA3JRSYjHGSCr3soAi5Su5VZmVNts3ocVUONOfR4yoTj/JGZ0HYvwi iQm4PxPLS7Fj69joQnernx6Dhem6yg8hJSwrU3HDZQ4lIDSQ2B220+uz9MrqImu21JvDxIRzmgH+ kQ7Q3+ZoxCi0BJX83yL8sh0wMA5tLrV1e8IKrpBR/mLiwQZRoaOPKXKx29eP4Q8St57UySGfGL13 O33jTgM8sAAGkImgAa2JzXRhYQ2KY5QnplYv1cxk6Zbpuq1TgovqGm3pzak0i1kTiK95K3tWYhMh 2ne2gmnXcO6CGu+QeUapXJN10sefE8gRNrBDLzpjJg== =HKc2 -----END PGP SIGNATURE----- --------------xWEDst9IZgYvLPukN2ga6r9b-- From nobody Wed Apr 6 15:48:31 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id AEEFF1A8A975; Wed, 6 Apr 2022 15:56:47 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu01208b.smtpx.saremail.com (cu01208b.smtpx.saremail.com [195.16.151.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYTf621Wdz4W0C; Wed, 6 Apr 2022 15:56:44 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend01.sarenet.es (Postfix) with ESMTPA id 1F6F860C6A8; Wed, 6 Apr 2022 17:48:31 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_a4ce21118f18db79ad9328c5f2cebb5a" Date: Wed, 06 Apr 2022 17:48:31 +0200 From: egoitz@ramattack.net To: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance Cc: owner-freebsd-fs@freebsd.org Subject: Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> Message-ID: <29f0eee5b502758126bf4cfa2d8e3517@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYTf621Wdz4W0C X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.151.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.151.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_a4ce21118f18db79ad9328c5f2cebb5a Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 I have been thinking and.... I got the following tunables now : vfs.zfs.arc_meta_strategy: 0 vfs.zfs.arc_meta_limit: 17179869184 kstat.zfs.misc.arcstats.arc_meta_min: 4294967296 kstat.zfs.misc.arcstats.arc_meta_max: 19386809344 kstat.zfs.misc.arcstats.arc_meta_limit: 17179869184 kstat.zfs.misc.arcstats.arc_meta_used: 16870668480 vfs.zfs.arc_max: 68719476736 and top sais : ARC: 19G Total, 1505M MFU, 12G MRU, 6519K Anon, 175M Header, 5687M Other When using even 128GB of vfs.zfs.arc_max (instead of 64GB I have now set) the ARC wasn't approximating to it's max usable size.... Can perhaps that could have something to do with that fact that arc meta values are almost at the limit set?. Perhaps increasing vfs.zfs.arc_meta_limit or kstat.zfs.misc.arcstats.arc_meta_limit (I suppose the first one is the one to increase) could cause a better performance and perhaps a better usage and better take advantage of having 64GB max of ARC set?. I say it because now it doesn't use more than 19GB in total ARC memory.... As always said, any opinion or idea would be very highly appreciated. Cheers, El 2022-04-06 17:30, egoitz@ramattack.net escribió: > ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > One perhaps important note!! > > When this happens... almost all processes appear in top in the following state: > > txg state or > > txg-> > > bio.... > > perhaps should the the vfs.zfs.dirty_data_max, vfs.zfs.txg.timeout, vfs.zfs.vdev.async_write_active_max_dirty_percent be increased, decreased.... I'm afraid of doing some chage ana finally ending up with an inestable server.... I'm not an expert in handling these values.... > > Any recommendation?. > > Best regards, > > El 2022-04-06 16:36, egoitz@ramattack.net escribió: > > ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > Hi Rainer! > > Thank you so much for your help :) :) > > Well I assume they are in a datacenter and should not be a power outage.... > > About dataset size... yes... our ones are big... they can be 3-4 TB easily each dataset..... > > We bought them, because as they are for mailboxes and mailboxes grow and grow.... for having space for hosting them... > > We knew they had some speed issues, but those speed issues, we thought (as Samsung explains in the QVO site) they started after exceeding the speeding buffer this disks have. We though that meanwhile you didn't exceed it's capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps we were wrong?. > > Best regards, > > El 2022-04-06 14:56, Rainer Duffner escribió: > > Am 06.04.2022 um 13:15 schrieb egoitz@ramattack.net: > I don't really know if, perhaps the QVO technology could be the guilty here.... because... they say are desktop computers disks... but later. > > Yeah, they are. > > Most likely, they don't have some sort of super-cap. > > A power-failure might totally toast the filesystem. > > These disks are - IMO - designed to accelerate read-operations. Their sustained write-performance is usually mediocre, at best. > > They might work well for small data-sets - because that is really written to some cache and the firmware just claims it's „written", but once the data-set becomes big enough, they are about as fast as a fast SATA-disk. > > https://www.tomshardware.com/reviews/samsung-970-evo-plus-ssd,5608.html --=_a4ce21118f18db79ad9328c5f2cebb5a Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

I have been thinking and.... I got the following tunables now :

vfs.zfs.arc_meta_strategy: 0
vfs.zfs.arc_meta_limit: 17179869184kstat.zfs.misc.arcstats.arc_meta_min: 4294967296
kstat.zfs.misc.arc= stats.arc_meta_max: 19386809344
kstat.zfs.misc.arcstats.arc_meta_limit= : 17179869184
kstat.zfs.misc.arcstats.arc_meta_used: 16870668480
= vfs.zfs.arc_max: 68719476736

and top sais :

ARC: 19G Total, 1505M MFU, 12G MRU, 6519K Anon, 175M Header, 5687M Other=


When using even 128GB of vfs.zfs.arc_max (instead of 64GB I have now set= ) the ARC wasn't approximating to it's max usable size.... Can perhaps that= could have something to do with that fact that arc meta values are almost = at the limit set?. Perhaps increasing vfs.zfs.arc_meta_limit or kstat.zfs= =2Emisc.arcstats.arc_meta_limit (I suppose the first one is the one to incr= ease) could cause a better performance and perhaps a better usage and bette= r take advantage of having 64GB max of ARC set?. I say it because now it do= esn't use more than 19GB in total ARC memory....


As always said, any opinion or idea would be very highly appreciated.


Cheers,


 


El 2022-04-06 17:30, egoitz@ramattack.net escribió:


ATENCION: Este correo se ha enviado = desde fuera de la organización. No pinche en los enlaces ni abra los= adjuntos a no ser que reconozca el remitente y sepa que el contenido es se= guro.

One perhaps important note!!


When this happens... almost all processes appear in top in the following= state:


txg state or

txg->

bio....


perhaps should the the vfs.zfs.dirty_data_max, vfs.zfs.txg.timeout, vfs= =2Ezfs.vdev.async_write_active_max_dirty_percent be increased, decreased.= =2E.. I'm afraid of doing some chage ana finally ending up with an inestabl= e server.... I'm not an expert in handling these values....


Any recommendation?.


Best regards,

 


El 2022-04-06 16:36, egoitz@ramattack.net escribió:


ATENCION: Este correo se ha enviado = desde fuera de la organización. No pinche en los enlaces ni abra los= adjuntos a no ser que reconozca el remitente y sepa que el contenido es se= guro.

Hi Rainer!


Thank you so much for your help :) :)

Well I assume they are in a datacenter and should not be a power outage= =2E...

About dataset size... yes... our ones are big... they can be 3-4 TB easi= ly each dataset.....

We bought them, because as they are for mailboxes and mailboxes grow and= grow.... for having space for hosting them...

We knew they had some speed issues, but those speed issues, we thought (= as Samsung explains in the QVO site) they started after exceeding the speed= ing buffer this disks have. We though that meanwhile you didn't exceed it's= capacity (the capacity of the speeding buffer) no speed problem arises. Pe= rhaps we were wrong?.


Best regards,



El 2022-04-06 14:56, Rainer Duffner escribió:



Am 06.04.2022 um 13:15 schrieb egoitz@ramattack.net:

I don't really know if, perhaps the QVO technolo= gy could be the guilty here.... because... they say are desktop computers d= isks... but later.

 
Yeah, they are.
 
Most likely, they don't have some sort of super-cap.
 
A power-failure might totally toast the filesystem.
 
These disks are - IMO -  designed to accelerate read-operations= =2E Their sustained write-performance is usually mediocre, at best.
 
They might work well for small data-sets - because that is really writ= ten to some cache and the firmware just claims it's „written", but on= ce the data-set becomes big enough, they are about as fast as a fast SATA-d= isk.
 
 
 
 
--=_a4ce21118f18db79ad9328c5f2cebb5a-- From nobody Wed Apr 6 16:34:56 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 94E3B1A93BAB; Wed, 6 Apr 2022 16:35:01 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYVVC4tX0z4ftP; Wed, 6 Apr 2022 16:34:59 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id 4B1C660C60B; Wed, 6 Apr 2022 18:34:56 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_9e84ea9eb28b05e81541398ce76d2803" Date: Wed, 06 Apr 2022 18:34:56 +0200 From: egoitz@ramattack.net To: Stefan Esser Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> Message-ID: X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYVVC4tX0z4ftP X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_9e84ea9eb28b05e81541398ce76d2803 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Stefan! Thank you so much for your answer!!. I do answer below in green bold for instance... for a better distinction.... Very thankful for all your comments Stefan!!! :) :) :) Cheers!! El 2022-04-06 17:43, Stefan Esser escribió: > ATENCION > ATENCION > ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net: > >> Hi Rainer! >> >> Thank you so much for your help :) :) >> >> Well I assume they are in a datacenter and should not be a power outage.... >> >> About dataset size... yes... our ones are big... they can be 3-4 TB easily each >> dataset..... >> >> We bought them, because as they are for mailboxes and mailboxes grow and >> grow.... for having space for hosting them... > > Which mailbox format (e.g. mbox, maildir, ...) do you use? > > I'M RUNNING CYRUS IMAP SO SORT OF MAILDIR... TOO MANY LITTLE FILES NORMALLY..... SOMETIMES DIRECTORIES WITH TONS OF LITTLE FILES.... > >> We knew they had some speed issues, but those speed issues, we thought (as >> Samsung explains in the QVO site) they started after exceeding the speeding >> buffer this disks have. We though that meanwhile you didn't exceed it's >> capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps >> we were wrong?. > > These drives are meant for small loads in a typical PC use case, > i.e. some installations of software in the few GB range, else only > files of a few MB being written, perhaps an import of media files > that range from tens to a few hundred MB at a time, but less often > than once a day. > > WE MOVE, YOU KNOW... LOTS OF LITTLE FILES... AND LOT'S OF DIFFERENT CONCURRENT MODIFICATIONS BY 1500-2000 CONCURRENT IMAP CONNECTIONS WE HAVE... > > As the SSD fills, the space available for the single level write > cache gets smaller > > THE SINGLE LEVEL WRITE CACHE IS THE CACHE THESE SSD DRIVERS HAVE, FOR COMPENSATING THE SPEED ISSUES THEY HAVE DUE TO USING QLC MEMORY?. DO YOU REFER TO THAT?. SORRY I DON'T UNDERSTAND WELL THIS PARAGRAPH. > > (on many SSDs, I have no numbers for this > particular device), and thus the amount of data that can be > written at single cell speed shrinks as the SSD gets full. > > I have just looked up the size of the SLC cache, it is specified > to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB > version, smaller models will have a smaller SLC cache). > > ASSUMING YOU WERE TALKING ABOUT THE CACHE FOR COMPENSATING SPEED WE PREVIOUSLY COMMENTED, I SHOULD SAY THESE ARE THE 870 QVO BUT THE 8TB VERSION. SO THEY SHOULD HAVE THE BIGGEST CACHE FOR COMPENSATING THE SPEED ISSUES... > > But after writing those few GB at a speed of some 500 MB/s (i.e. > after 12 to 150 seconds), the drive will need several minutes to > transfer those writes to the quad-level cells, and will operate > at a fraction of the nominal performance during that time. > (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the > 2 TB model.) > > WELL WE ARE IN THE 8TB MODEL. I THINK I HAVE UNDERSTOOD WHAT YOU WROTE IN PREVIOUS PARAGRAPH. YOU SAID THEY CAN BE FAST BUT NOT CONSTANTLY, BECAUSE LATER THEY HAVE TO WRITE ALL THAT TO THEIR PERPETUAL STORAGE FROM THE CACHE. AND THAT'S SLOW. AM I WRONG?. EVEN IN THE 8TB MODEL YOU THINK STEFAN?. > > THE MAIN PROBLEM WE ARE FACING IS THAT IN SOME PEAK MOMENTS, WHEN THE MACHINE SERVES CONNECTIONS FOR ALL THE INSTANCES IT HAS, AND ONLY AS SAID IN SOME PEAK MOMENTS... LIKE THE 09AM OR THE 11AM.... IT SEEMS THE MACHINE BECOMES SLOWER... AND LIKE IF THE DISKS WEREN'T ABLE TO SERVE ALL THEY HAVE TO SERVE.... IN THESE MOMENTS, NO BIG FILES ARE MOVED... BUT AS WE HAVE 1800-2000 CONCURRENT IMAP CONNECTIONS... NORMALLY THEY ARE DOING EACH ONE... LITTLE CHANGES IN THEIR MAILBOX. DO YOU THINK PERHAPS THIS DISKS THEN ARE NOT APPROPRIATE FOR THIS KIND OF USAGE?- > > And cheap SSDs often have no RAM cache (not checked, but I'd be > surprised if the QVO had one) and thus cannot keep bookkeeping date > in such a cache, further limiting the performance under load. > > THIS BROCHURE (HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/BROCHURE/870_SERIES_BROCHURE.PDF AND THE DATASHEET HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/DATA-SHEET/SAMSUNG_SSD_870_QVO_DATA_SHEET_REV1.1.PDF) SAIS IF I HAVE READ PROPERLY, THE 8TB DRIVE HAS 8GB OF RAM?. I ASSUME THAT IS WHAT THEY CALL THE TURBO WRITE CACHE?. > > And the resilience (max. amount of data written over its lifetime) > is also quite low - I hope those drives are used in some kind of > RAID configuration. > > YEP WE USE RAIDZ-2 > > The 870 QVO is specified for 370 full capacity > writes, i.e. 370 TB for the 1 TB model. That's still a few hundred > GB a day - but only if the write amplification stays in a reasonable > range ... > > WELL YES... 2880TB IN OUR CASE....NOT BAD.. ISN'T IT? --=_9e84ea9eb28b05e81541398ce76d2803 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Stefan!


Thank you so much for your answer!!. I do answer below in green bold for= instance... for a better distinction....


Very thankful for all your comments Stefan!!! :) :) :)


Cheers!!

 


El 2022-04-06 17:43, Stefan Esser escribió:

= ATENCION
ATENCION
ATENCION!!! Este correo se ha enviado desde f= uera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no= ser que reconozca el remitente y sepa que el contenido es seguro.
Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net:
Hi Rainer!

Thank you so much for your hel= p :) :)

Well I assume they are in a datacenter and should not = be a power outage....

About dataset size... yes... our ones ar= e big... they can be 3-4 TB easily each
dataset.....

We = bought them, because as they are for mailboxes and mailboxes grow and
= grow.... for having space for hosting them...

Which mailbox format (e.g. mbox, maildir, ...) do you use?
=  
= I'm running Cyrus imap so sort of M= aildir... too many little files normally..... Sometimes directories with to= ns of little files....

We knew they had some speed issues, but those speed is= sues, we thought (as
Samsung explains in the QVO site) they started a= fter exceeding the speeding
buffer this disks have. We though that me= anwhile you didn't exceed it's
capacity (the capacity of the speeding= buffer) no speed problem arises. Perhaps
we were wrong?.
These drives are meant for small loads in a typical PC use case,
i.e. some installations of software in the few GB range, else only
= files of a few MB being written, perhaps an import of media files
th= at range from tens to a few hundred MB at a time, but less often
than= once a day.
=  
= We move, you know... lots of little= files... and lot's of different concurrent modifications by 1500-2000 conc= urrent imap connections we have...
=
As the SSD fills, the space available for the single level write
cache gets smaller
=  
= The single level write cache is the= cache these ssd drivers have, for compensating the speed issues they have = due to using qlc memory?. Do you refer to that?. Sorry I don't understand w= ell this paragraph.
=  
= (on many SSDs, I have no numbers for this
particular device), and thu= s the amount of data that can be
written at single cell speed shrinks= as the SSD gets full.
=  
=

I have just looked up the size of the SLC cache, it is specif= ied
to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB=
version, smaller models will have a smaller SLC cache).
=  
= Assuming you were talking about the= cache for compensating speed we previously commented, I should say these a= re the 870 QVO but the 8TB version. So they should have the biggest cache f= or compensating the speed issues...
=  
=

But after writing those few GB at a speed of some 500 MB/s (i= =2Ee.
after 12 to 150 seconds), the drive will need several minutes t= o
transfer those writes to the quad-level cells, and will operate
at a fraction of the nominal performance during that time.
(QLC wr= ites max out at 80 MB/s for the 1 TB model, 160 MB/s for the
2 TB mod= el.)
=  
= Well we are in the 8TB model. I thi= nk I have understood what you wrote in previous paragraph. You said they ca= n be fast but not constantly, because later they have to write all that to = their perpetual storage from the cache. And that's slow. Am I wrong?. Even = in the 8TB model you think Stefan?.
=  
= The main problem we are facing is t= hat in some peak moments, when the machine serves connections for all the i= nstances it has, and only as said in some peak moments... like the 09am or = the 11am.... it seems the machine becomes slower... and like if the disks w= eren't able to serve all they have to serve.... In these moments, no big fi= les are moved... but as we have 1800-2000 concurrent imap connections... no= rmally they are doing each one... little changes in their mailbox. Do you t= hink perhaps this disks then are not appropriate for this kind of usage?-

And cheap SSDs often have no RAM cache (not che= cked, but I'd be
surprised if the QVO had one) and thus cannot keep b= ookkeeping date
in such a cache, further limiting the performance und= er load.
=  
= This brochure (https://semiconductor.samsung.com/resources/brochure/87= 0_Series_Brochure.pdf and the datasheet https://semiconductor.samsung= =2Ecom/resources/data-sheet/Samsung_SSD_870_QVO_Data_Sheet_Rev1.1.pdf) sais= if I have read properly, the 8TB drive has 8GB of ram?. I assume that is w= hat they call the turbo write cache?.

And the = resilience (max. amount of data written over its lifetime)
is also qu= ite low - I hope those drives are used in some kind of
RAID configura= tion.
=  
= Yep we use raidz-2<= /div>
=  
= The 870 QVO is specified for 370 full capacity
writes, i.e. 370 TB fo= r the 1 TB model. That's still a few hundred
GB a day - but only if t= he write amplification stays in a reasonable
range ...
=  
= Well yes... 2880TB in our case...= =2Enot bad.. isn't it?
--=_9e84ea9eb28b05e81541398ce76d2803-- From nobody Wed Apr 6 16:51:49 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 437DE1A983A4; Wed, 6 Apr 2022 16:51:55 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu01208b.smtpx.saremail.com (cu01208b.smtpx.saremail.com [195.16.151.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYVsk0NY1z4lVF; Wed, 6 Apr 2022 16:51:52 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend01.sarenet.es (Postfix) with ESMTPA id 027BD60C6BF; Wed, 6 Apr 2022 18:51:49 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_be6199b7d5e868a9b11faca98e7631c5" Date: Wed, 06 Apr 2022 18:51:49 +0200 From: egoitz@ramattack.net To: Eugene Grosbein Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> Message-ID: <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYVsk0NY1z4lVF X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.151.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.151.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_be6199b7d5e868a9b11faca98e7631c5 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Eugene!!! Thank you so much really again mate :) :) :) About your recommendations... Eugene, if some of them wouldn't be working as expected, could we revert some or all of them or perhaps some of your recommendations below need to be definitive?. I do answer below in green bold for better distinction :) :) El 2022-04-06 18:14, Eugene Grosbein escribió: > ATENCION > ATENCION > ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > 06.04.2022 22:30, egoitz@ramattack.net пишет: > >> One perhaps important note!! >> >> When this happens... almost all processes appear in top in the following state: >> >> txg state or >> >> txg-> >> >> bio.... >> >> perhaps should the the vfs.zfs.dirty_data_max, vfs.zfs.txg.timeout, vfs.zfs.vdev.async_write_active_max_dirty_percent be increased, decreased.... I'm afraid of doing some chage ana finally ending up with an inestable server.... I'm not an expert in handling these values.... >> >> Any recommendation?. > > 1) Make sure the pool has enough free space because ZFS can became crawling slow otherwise. > > THIS IS JUST AN EXAMPLE... BUT YOU CAN SEE ALL SIMILARLY.... > > ZPOOL LIST > NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT > ZROOT 448G 2.27G 446G - - 1% 0% 1.00X ONLINE - > MAIL_DATASET 58.2T 19.4T 38.8T - - 32% 33% 1.00X ONLINE - > > 2) Increase recordsize upto 1MB for file systems located in the pool > so ZFS is allowed to use bigger request sizes for read/write operations > > WE HAVE THE DEFAULT... SO 128K... > > 3) If you use compression, look if achieved compressratio worth it and > if not (<1.4 f.e.) then better disable compression to avoid its overhead; > > WE DON'T USE COMPRESSION AS IT'S NOT SET BY DEFAULT. SOME PEOPLE SAY YOU SHOULD HAVE IT ENABLED.... BUT.... JUST FOR AVOID HAVING SOME DATA COMPRESSED SOME OTHER NOT (IN CASE YOU ENABLE AND LATER DISABLE) AND FINALLY FOR AVOID ACCESSING TO INFORMATION WITH DIFFERENT CPU COSTS OF HANDLING... WE HAVE NOT TOUCHED COMPRESSION.... > > WE SHOULD SAY WE HAVE LOTS OF CPU... > > 4) try "zfs set redundant_metadata=most" to decrease amount of small writes to the file systems; > > OK.... > > 5) If you have good power supply and stable (non-crashing) OS, try increasing > sysctl vfs.zfs.txg.timeout from defaule 5sec, but do not be extreme (f.e. upto 10sec). > Maybe it will increase amount of long writes and decrease amount of short writes, that is good. > > WELL I HAVE SYNC IN DISABLED IN THE DATASETS... DO YOU STILL THINK IT'S GOOD TO CHANGE IT?. JUST A QUESTION OF PERSON WANTING TO LEARN :) . > > WHAT ABOUT THE VFS.ZFS.DIRTY_DATA_MAX AND THE VFS.ZFS.DIRTY_DATA_MAX_MAX, WOULD YOU INCREASE THEM FROM 4GB IT'S SET NOW?. > > THANKS A LOT EUGENE!!!! > CHEERS!! --=_be6199b7d5e868a9b11faca98e7631c5 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Eugene!!!


Thank you so much really again mate  :) :) :)


About your recommendations... Eugene, if some of them wouldn't be workin= g as expected, could we revert some or all of them or perhaps some of your = recommendations below need to be definitive?.


I do answer below in green bold for better distinction :) :)



 


El 2022-04-06 18:14, Eugene Grosbein escribió:

= ATENCION
ATENCION
ATENCION!!! Este correo se ha enviado desde f= uera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no= ser que reconozca el remitente y sepa que el contenido es seguro.
06.04.2022 22:30, egoitz@ramat= tack.net =D0=BF=D0=B8=D1=88=D0=B5=D1=82:
One perhaps important note!!


When = this happens... almost all processes appear in top in the following state:<= br />

txg state or

txg->

bio..= =2E.


perhaps should the the vfs.zfs.dirty_data_max, vfs= =2Ezfs.txg.timeout, vfs.zfs.vdev.async_write_active_max_dirty_percent be in= creased, decreased.... I'm afraid of doing some chage ana finally ending up= with an inestable server.... I'm not an expert in handling these values.= =2E..


Any recommendation?.

1) Make sure the pool has enough free space because ZFS can became c= rawling slow otherwise.
=  
= This is just an example... but you = can see all similarly....
=  
= zpool list
NAME     &nbs= p;       SIZE  ALLOC   FREE&nb= sp; CKPOINT  EXPANDSZ   FRAG    CAP  DED= UP  HEALTH  ALTROOT

zroot          = ;   448G  2.27G   446G    &nbs= p;   -         - &nb= sp;   1%     0%  1.00x  ONLINE = ; -
mail_datas= et  58.2T  19.4T  38.8T      &= nbsp; -         -   = 32%    33%  1.00x  ONLINE  -=


2) Increase recordsize upto 1MB for file systems locate= d in the pool
so ZFS is allowed to use bigger request sizes for read/= write operations
=  
= We have the default... so 128K...

3) If you use compression, look if achieved com= pressratio worth it and
if not (<1.4 f.e.) then better disable com= pression to avoid its overhead;
=  
= We don't use compression as it's no= t set by default. Some people say you should have it enabled.... but.... ju= st for avoid having some data compressed some other not (in case you enable= and later disable) and finally for avoid accessing to information with dif= ferent cpu costs of handling... we have not touched compression....<= /strong>
=  
= We should say we have lots of CPU= =2E..
=

4) try "zfs set redundant_metadata=3Dmost" to decrease amount= of small writes to the file systems;
=  
= Ok....
=

5) If you have good power supply and stable (non-crashing) OS= , try increasing
sysctl vfs.zfs.txg.timeout from defaule 5sec, but do= not be extreme (f.e. upto 10sec).
Maybe it will increase amount of l= ong writes and decrease amount of short writes, that is good.
=  
= Well I have sync in disabled in the= datasets... do you still think it's good to change it?. Just a question of= person wanting to learn :) .
=  
= What about the vfs.zfs.dirty_data_m= ax and the vfs.zfs.dirty_data_max_max, would you increase them from 4GB it'= s set now?.
=  
=  
=  
=  
=  
=  
=  
=  
=  
=  
=  
=  
= Thanks a lot Eugene!!!!
= Cheers!!
--=_be6199b7d5e868a9b11faca98e7631c5-- From nobody Wed Apr 6 20:43:49 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 612D21A94AD9; Wed, 6 Apr 2022 20:44:01 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYc1X2RkDz4gKb; Wed, 6 Apr 2022 20:44:00 +0000 (UTC) (envelope-from mike@sentex.net) Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.16.1/8.16.1) with ESMTPS id 236KhqDK072088 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 6 Apr 2022 16:43:52 -0400 (EDT) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:434:73cd:9d42:28ad] ([IPv6:2607:f3e0:0:4:434:73cd:9d42:28ad]) by pyroxene2a.sentex.ca (8.16.1/8.15.2) with ESMTPS id 236Khnm3007273 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Wed, 6 Apr 2022 16:43:49 -0400 (EDT) (envelope-from mike@sentex.net) Message-ID: Date: Wed, 6 Apr 2022 16:43:49 -0400 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS Content-Language: en-US To: Bob Friesenhahn , egoitz@ramattack.net Cc: freebsd-fs@FreeBSD.org, freebsd-hackers@FreeBSD.org, Freebsd performance References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> From: mike tancsa In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 X-Rspamd-Queue-Id: 4KYc1X2RkDz4gKb X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net X-Spamd-Result: default: False [-3.18 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; FREEFALL_USER(0.00)[mike]; FROM_HAS_DN(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[sentex.net]; RCPT_COUNT_FIVE(0.00)[5]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-0.78)[-0.782]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; RCVD_TLS_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received] X-ThisMailContainsUnwantedMimeParts: N On 4/6/2022 4:18 PM, Bob Friesenhahn wrote: > On Wed, 6 Apr 2022, egoitz@ramattack.net wrote: >>> >>> WE DON'T USE COMPRESSION AS IT'S NOT SET BY DEFAULT. SOME PEOPLE SAY >>> YOU SHOULD HAVE IT ENABLED.... BUT.... JUST FOR AVOID HAVING SOME >>> DATA COMPRESSED SOME OTHER NOT (IN CASE YOU ENABLE AND LATER >>> DISABLE) AND FINALLY FOR AVOID ACCESSING TO INFORMATION WITH >>> DIFFERENT CPU COSTS OF HANDLING... WE HAVE NOT TOUCHED COMPRESSION.... > > There seems to be a problem with your caps-lock key. > > Since it seems that you said that you are using maildir for your mail > server, it is likely very useful if you do enable even rather mild > compression (e.g. lz4) since this will reduce the write work-load and > even short files will be stored more efficiently. > FYI, a couple of our big zfs  mailspools sees a 1.24x and 1.23x compress ratio with lz4.  We use Maildir format as well.  They are not RELENG_13 so not sure how zstd would fair.     ---Mike From nobody Wed Apr 6 21:06:15 2022 X-Original-To: performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 0BE611A9C465 for ; Wed, 6 Apr 2022 21:06:26 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from mail.rlwinm.de (mail.rlwinm.de [138.201.35.217]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYcWP0nV9z4n1p for ; Wed, 6 Apr 2022 21:06:25 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from [IPV6:2001:16b8:6410:e900:8468:f98d:8c6b:de2c] (200116b86410e9008468f98d8c6bde2c.dip.versatel-1u1.de [IPv6:2001:16b8:6410:e900:8468:f98d:8c6b:de2c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by mail.rlwinm.de (Postfix) with ESMTPSA id 092A92492B for ; Wed, 6 Apr 2022 21:06:17 +0000 (UTC) Content-Type: multipart/alternative; boundary="------------O9TimV1H5koHqUIsOKaj4g23" Message-ID: <803f008d-b91a-2a8d-88f9-3d2d091149df@rlwinm.de> Date: Wed, 6 Apr 2022 23:06:15 +0200 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS Content-Language: en-US To: performance@freebsd.org References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> From: Jan Bramkamp In-Reply-To: X-Rspamd-Queue-Id: 4KYcWP0nV9z4n1p X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of crest@rlwinm.de designates 138.201.35.217 as permitted sender) smtp.mailfrom=crest@rlwinm.de X-Spamd-Result: default: False [-2.91 / 15.00]; RCVD_VIA_SMTP_AUTH(0.00)[]; ARC_NA(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; R_SPF_ALLOW(-0.20)[+mx]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; TO_DN_NONE(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[performance@freebsd.org]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_LONG(-1.00)[-1.000]; DMARC_NA(0.00)[rlwinm.de]; NEURAL_HAM_SHORT(-0.61)[-0.612]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; MLMMJ_DEST(0.00)[performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:24940, ipnet:138.201.0.0/16, country:DE]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; RECEIVED_SPAMHAUS_PBL(0.00)[2001:16b8:6410:e900:8468:f98d:8c6b:de2c:received] X-ThisMailContainsUnwantedMimeParts: N This is a multi-part message in MIME format. --------------O9TimV1H5koHqUIsOKaj4g23 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 06.04.22 18:34, egoitz@ramattack.net wrote: > > Hi Stefan! > > > Thank you so much for your answer!!. I do answer below in green bold > for instance... for a better distinction.... > > > Very thankful for all your comments Stefan!!! :) :) :) > > > Cheers!! > > > El 2022-04-06 17:43, Stefan Esser escribió: > >> ATENCION >> ATENCION >> ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. >> No pinche en los enlaces ni abra los adjuntos a no ser que reconozca >> el remitente y sepa que el contenido es seguro. >> >> Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net: >>> Hi Rainer! >>> >>> Thank you so much for your help :) :) >>> >>> Well I assume they are in a datacenter and should not be a power >>> outage.... >>> >>> About dataset size... yes... our ones are big... they can be 3-4 TB >>> easily each >>> dataset..... >>> >>> We bought them, because as they are for mailboxes and mailboxes grow and >>> grow.... for having space for hosting them... >> >> Which mailbox format (e.g. mbox, maildir, ...) do you use? >> *I'm running Cyrus imap so sort of Maildir... too many little files >> normally..... Sometimes directories with tons of little files....* >> >>> We knew they had some speed issues, but those speed issues, we >>> thought (as >>> Samsung explains in the QVO site) they started after exceeding the >>> speeding >>> buffer this disks have. We though that meanwhile you didn't exceed it's >>> capacity (the capacity of the speeding buffer) no speed problem >>> arises. Perhaps >>> we were wrong?. >> >> These drives are meant for small loads in a typical PC use case, >> i.e. some installations of software in the few GB range, else only >> files of a few MB being written, perhaps an import of media files >> that range from tens to a few hundred MB at a time, but less often >> than once a day. >> *We move, you know... lots of little files... and lot's of different >> concurrent modifications by 1500-2000 concurrent imap connections we >> have...* >> >> As the SSD fills, the space available for the single level write >> cache gets smaller >> *The single level write cache is the cache these ssd drivers have, >> for compensating the speed issues they have due to using qlc memory?. >> Do you refer to that?. Sorry I don't understand well this paragraph.* A single flash cell can be thought of as a software adjustable resistor as part of a voltage divider with a fixed resistor. Storing just a single bit per flash cell allows very fast writes and long lifetimes for each flash cell at the cost of low data density. You cheaped out and bough the crappiest type of consumer SSDs. These SSDs are optimized for one thing: price per capacity (at reasonable read performance). They accomplish this by exploiting the expected user behavior of modifying only small subsets of the stored data in short bursts and buying (a lot more capacity) than they use. You deployed them in a mail server facing at least continuous writes for hours on end most days of the week. As average load increases and the cheap SSDs fill up less and less unallocated flash can be used to cache and the fast SLC cache fills up. The SSD firmware now has to stop accepting new requests from the SATA port and because only ~30 operations can be queued per SATA disk and the ordering requirements between those operations not even reads can be satisfied while the cache gets slowly written out storing four bits per flash cell instead of one. To the user this appears as the system almost hanging because every uncached read and sync write takes tens to 100s of milliseconds instead of less than 3ms. No amount of file system or driver tuning can truly fix this design flaw/compromise without severely limiting the write throughput in software to stay below the sustained drain rate of the SLC cache. If you want to invest time, pain and suffering to squish the most out of this hardware look into the ~2015 CAM I/O scheduler work Netflix upstreamed back to FreeBSD. Enabling this requires at least building and installing your own kernel with this feature enabled, setting acceptable latency targets and defining the read/write mix the scheduler should maintain. I don't expect you'll get satisfactory results out of those disks even with lots of experimentation. If you want to experiment with I/O scheduling on cheap SSDs start by *migrating all production workloads* out of your lab environment. The only safe and quick way out of this mess is for you to replace all QVO SSDs with at least as large SSDs designed for sustained write workloads. --------------O9TimV1H5koHqUIsOKaj4g23 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit


On 06.04.22 18:34, egoitz@ramattack.net wrote:

Hi Stefan!


Thank you so much for your answer!!. I do answer below in green bold for instance... for a better distinction....


Very thankful for all your comments Stefan!!! :) :) :)


Cheers!!

 


El 2022-04-06 17:43, Stefan Esser escribió:

ATENCION
ATENCION
ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro.

Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net:
Hi Rainer!

Thank you so much for your help :) :)

Well I assume they are in a datacenter and should not be a power outage....

About dataset size... yes... our ones are big... they can be 3-4 TB easily each
dataset.....

We bought them, because as they are for mailboxes and mailboxes grow and
grow.... for having space for hosting them...

Which mailbox format (e.g. mbox, maildir, ...) do you use?
 
I'm running Cyrus imap so sort of Maildir... too many little files normally..... Sometimes directories with tons of little files....

We knew they had some speed issues, but those speed issues, we thought (as
Samsung explains in the QVO site) they started after exceeding the speeding
buffer this disks have. We though that meanwhile you didn't exceed it's
capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps
we were wrong?.

These drives are meant for small loads in a typical PC use case,
i.e. some installations of software in the few GB range, else only
files of a few MB being written, perhaps an import of media files
that range from tens to a few hundred MB at a time, but less often
than once a day.
 
We move, you know... lots of little files... and lot's of different concurrent modifications by 1500-2000 concurrent imap connections we have...

As the SSD fills, the space available for the single level write
cache gets smaller
 
The single level write cache is the cache these ssd drivers have, for compensating the speed issues they have due to using qlc memory?. Do you refer to that?. Sorry I don't understand well this paragraph.

A single flash cell can be thought of as a software adjustable resistor as part of a voltage divider with a fixed resistor. Storing just a single bit per flash cell allows very fast writes and long lifetimes for each flash cell at the cost of low data density. You cheaped out and bough the crappiest type of consumer SSDs. These SSDs are optimized for one thing: price per capacity (at reasonable read performance). They accomplish this by exploiting the expected user behavior of modifying only small subsets of the stored data in short bursts and buying (a lot more capacity) than they use. You deployed them in a mail server facing at least continuous writes for hours on end most days of the week. As average load increases and the cheap SSDs fill up less and less unallocated flash can be used to cache and the fast SLC cache fills up. The SSD firmware now has to stop accepting new requests from the SATA port and because only ~30 operations can be queued per SATA disk and the ordering requirements between those operations not even reads can be satisfied while the cache gets slowly written out storing four bits per flash cell instead of one. To the user this appears as the system almost hanging because every uncached read and sync write takes tens to 100s of milliseconds instead of less than 3ms. No amount of file system or driver tuning can truly fix this design flaw/compromise without severely limiting the write throughput in software to stay below the sustained drain rate of the SLC cache. If you want to invest time, pain and suffering to squish the most out of this hardware look into the ~2015 CAM I/O scheduler work Netflix upstreamed back to FreeBSD. Enabling this requires at least building and installing your own kernel with this feature enabled, setting acceptable latency targets and defining the read/write mix the scheduler should maintain.

I don't expect you'll get satisfactory results out of those disks even with lots of experimentation. If you want to experiment with I/O scheduling on cheap SSDs start by migrating all production workloads out of your lab environment. The only safe and quick way out of this mess is for you to replace all QVO SSDs with at least as large SSDs designed for sustained write workloads.

--------------O9TimV1H5koHqUIsOKaj4g23-- From nobody Wed Apr 6 21:19:22 2022 X-Original-To: performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id DA72D1A9F4B2 for ; Wed, 6 Apr 2022 21:19:33 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from mail.rlwinm.de (mail.rlwinm.de [IPv6:2a01:4f8:171:f902::5]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYcpX42LGz4pkd for ; Wed, 6 Apr 2022 21:19:32 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from [IPV6:2001:16b8:6410:e900:8468:f98d:8c6b:de2c] (200116b86410e9008468f98d8c6bde2c.dip.versatel-1u1.de [IPv6:2001:16b8:6410:e900:8468:f98d:8c6b:de2c]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-384) server-signature RSA-PSS (4096 bits)) (No client certificate requested) by mail.rlwinm.de (Postfix) with ESMTPSA id 564CB2492C for ; Wed, 6 Apr 2022 21:19:23 +0000 (UTC) Content-Type: multipart/alternative; boundary="------------G4wltCyoa74m5jS4qYaJ2Oo0" Message-ID: Date: Wed, 6 Apr 2022 23:19:22 +0200 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS Content-Language: en-US To: performance@freebsd.org References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> From: Jan Bramkamp In-Reply-To: X-Rspamd-Queue-Id: 4KYcpX42LGz4pkd X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of crest@rlwinm.de designates 2a01:4f8:171:f902::5 as permitted sender) smtp.mailfrom=crest@rlwinm.de X-Spamd-Result: default: False [-3.26 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+mx:c]; TO_MATCH_ENVRCPT_ALL(0.00)[]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; PREVIOUSLY_DELIVERED(0.00)[performance@freebsd.org]; TO_DN_NONE(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; NEURAL_HAM_LONG(-1.00)[-1.000]; DMARC_NA(0.00)[rlwinm.de]; NEURAL_HAM_SHORT(-0.96)[-0.963]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; MLMMJ_DEST(0.00)[performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:24940, ipnet:2a01:4f8::/32, country:DE]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[]; RECEIVED_SPAMHAUS_PBL(0.00)[2001:16b8:6410:e900:8468:f98d:8c6b:de2c:received] X-ThisMailContainsUnwantedMimeParts: N This is a multi-part message in MIME format. --------------G4wltCyoa74m5jS4qYaJ2Oo0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit On 06.04.22 22:43, mike tancsa wrote: > On 4/6/2022 4:18 PM, Bob Friesenhahn wrote: >> On Wed, 6 Apr 2022, egoitz@ramattack.net wrote: >>>> >>>> WE DON'T USE COMPRESSION AS IT'S NOT SET BY DEFAULT. SOME PEOPLE >>>> SAY YOU SHOULD HAVE IT ENABLED.... BUT.... JUST FOR AVOID HAVING >>>> SOME DATA COMPRESSED SOME OTHER NOT (IN CASE YOU ENABLE AND LATER >>>> DISABLE) AND FINALLY FOR AVOID ACCESSING TO INFORMATION WITH >>>> DIFFERENT CPU COSTS OF HANDLING... WE HAVE NOT TOUCHED COMPRESSION.... >> >> There seems to be a problem with your caps-lock key. >> >> Since it seems that you said that you are using maildir for your mail >> server, it is likely very useful if you do enable even rather mild >> compression (e.g. lz4) since this will reduce the write work-load and >> even short files will be stored more efficiently. >> > FYI, a couple of our big zfs  mailspools sees a 1.24x and 1.23x > compress ratio with lz4.  We use Maildir format as well.  They are not > RELENG_13 so not sure how zstd would fair. I've found that Dovecot's mdbox format compresses a lot better than Maildir (or sdbox), because it stores multiple messages per file resulting in files large enough to contain enough exploitable reduncancy to compress down to the next smaller blocksize. In a corporate or education environment where users tend to send the same medium to large attachments multiple times to multiple recipients on the same server Dovecot's single instance storage is a game changer. It reduced my IMAP storage requirements by a *factor* of 4.7 which allowed me to get rid of spinning disks for the mail servers instead of playing losing games with hybrid storage. Dovecot also supports zlib compression in the application instead of punting it to the file system. I don't know if Cyrus IMAP offers similar features, but if it does I would recommend evaluating them instead of compressing or deduplicating at the file system level. --------------G4wltCyoa74m5jS4qYaJ2Oo0 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
On 06.04.22 22:43, mike tancsa wrote:
On 4/6/2022 4:18 PM, Bob Friesenhahn wrote:
On Wed, 6 Apr 2022, egoitz@ramattack.net wrote:

WE DON'T USE COMPRESSION AS IT'S NOT SET BY DEFAULT. SOME PEOPLE SAY YOU SHOULD HAVE IT ENABLED.... BUT.... JUST FOR AVOID HAVING SOME DATA COMPRESSED SOME OTHER NOT (IN CASE YOU ENABLE AND LATER DISABLE) AND FINALLY FOR AVOID ACCESSING TO INFORMATION WITH DIFFERENT CPU COSTS OF HANDLING... WE HAVE NOT TOUCHED COMPRESSION....

There seems to be a problem with your caps-lock key.

Since it seems that you said that you are using maildir for your mail server, it is likely very useful if you do enable even rather mild compression (e.g. lz4) since this will reduce the write work-load and even short files will be stored more efficiently.

FYI, a couple of our big zfs  mailspools sees a 1.24x and 1.23x compress ratio with lz4.  We use Maildir format as well.  They are not RELENG_13 so not sure how zstd would fair.
I've found that Dovecot's mdbox format compresses a lot better than Maildir (or sdbox), because it stores multiple messages per file resulting in files large enough to contain enough exploitable reduncancy to compress down to the next smaller blocksize. In a corporate or education environment where users tend to send the same medium to large attachments multiple times to multiple recipients on the same server Dovecot's single instance storage is a game changer. It reduced my IMAP storage requirements by a factor of 4.7 which allowed me to get rid of spinning disks for the mail servers instead of playing losing games with hybrid storage. Dovecot also supports zlib compression in the application instead of punting it to the file system. I don't know if Cyrus IMAP offers similar features, but if it does I would recommend evaluating them instead of compressing or deduplicating at the file system level.
--------------G4wltCyoa74m5jS4qYaJ2Oo0-- From nobody Wed Apr 6 21:49:15 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4B1871AA6E8B; Wed, 6 Apr 2022 21:49:21 +0000 (UTC) (envelope-from se@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYdSx1J3Rz3BqT; Wed, 6 Apr 2022 21:49:21 +0000 (UTC) (envelope-from se@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649281761; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NTQDA7AmEiqtZXUltjZC6r1zqKWrWLTuqyQu/QurDx8=; b=gb/sw9et6KM9TPLK36zERnOOE6ihAh7FHW0Z5Fgu10FhytX/Y4J4Pnd4u2AlvMSr8DL7Kg iRDaVGq2tiA5Yz7RzmwkeTQxw3flBHgJmswMtg8EOC1gl/L6SIqgBKMVJ4ICBWBxc95STj fSxZCei8tWY+Pwm4+qpUrzivT6+QMaMMm2ig9Wlk6oFlv+yEKwmxiHaqhQpVVpPCqystKX 1MPoTCNvMJgUU7gyYkTGWICu5UvV4fBq9gtF1ErEGXDeqCQ1w3ENkbV79lPsZ6knddmHRz 6YK0Vv0o6WEPzjboA8bpJY6rC5cS6KIfq81o7l4IM/VRAkx566gjymcWXm8a9g== Received: from [IPV6:2003:cd:5f22:6f00:953e:7ee1:500e:87a1] (p200300cd5f226f00953e7ee1500e87a1.dip0.t-ipconnect.de [IPv6:2003:cd:5f22:6f00:953e:7ee1:500e:87a1]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: se/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id 176914D30; Wed, 6 Apr 2022 21:49:19 +0000 (UTC) (envelope-from se@FreeBSD.org) Message-ID: Date: Wed, 6 Apr 2022 23:49:15 +0200 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS Content-Language: en-US To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> From: Stefan Esser In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------Lic1usorjc8S7FifC6L0nUmq" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649281761; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=NTQDA7AmEiqtZXUltjZC6r1zqKWrWLTuqyQu/QurDx8=; b=v5/9q0D36uM6by3oYOW443+rDA3y21ugXHLeseeuzJs6uALdA6p7lQNGhL/K5PLN0SncDG Ze2s0aAILCb2FaUJDRp5csshI8Bd9nT+L5jXZJrOKJ7L4RHetAoNVXCj2/Onql93fRJft7 0IIaiCTwWEkQWgpWvpqm8VhLvNkr1LlxW5Ml6Oh7uxJgefQH/r0Y5jKry+tmSeRtIvNAdc 2kaZJrECFKqL+zOJeb5qObc1ss0F4ViPi/wrPL3DXSGn6Cc2J/hdU6A18B4fGGMJEtVOeJ +Ln/wIfCbLW+CMB3b86IlBeIF6YaestXttvguHQa68/IgnQl+mCKQ9TkFYvjqw== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1649281761; a=rsa-sha256; cv=none; b=KT9L8BUHeIW4nZrivqVSEpnzwZ2veeKiP2ovXRvBoh2v+loi8Ts72fDcwkndPNSnqlNzPQ CD1E9/NP2epXmB33xK+oL9fJN1k0ZbmLf9uC8jasQwU77EQM7jKDKbuLu1rSHhNC2Avw3r Q+0lAxAyhzDjMAoF4MoAND7D7RAMkOLMeRvKMjRnYKV4FMCLtN+LTtXs2bLcr+j+hQ74Rv D1mKpI60anTgLJDuSxe0ScgthHQ1MpvgjLUTRFPuB/g1Bz10QS/jTP/IAWKkWua97/6hQV PCaznQuyUuI+N+IFFpJ2JhW1zuQfJUgJMP3Vt4tzVxYuM1WPzVcSeYA0EMUtRA== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------Lic1usorjc8S7FifC6L0nUmq Content-Type: multipart/mixed; boundary="------------OGljiRSjG08yilMHhaNFyDtW"; protected-headers="v1" From: Stefan Esser To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner Message-ID: Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> In-Reply-To: --------------OGljiRSjG08yilMHhaNFyDtW Content-Type: multipart/alternative; boundary="------------ASZf2V7VPlu8lceG03SodTv5" --------------ASZf2V7VPlu8lceG03SodTv5 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Am 06.04.22 um 18:34 schrieb egoitz@ramattack.net: > Hi Stefan! > > Thank you so much for your answer!!. I do answer below in green bold fo= r > instance... for a better distinction.... > > Very thankful for all your comments Stefan!!! :) :) :) > > Cheers!! > Hi, glad to hear that it is useful information - I'll add comments below ... > El 2022-04-06 17:43, Stefan Esser escribi=C3=B3: > >> ATENCION >> ATENCION >> ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. = No >> pinche en los enlaces ni abra los adjuntos a no ser que reconozca el >> remitente y sepa que el contenido es seguro. >> >> Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net: >>> Hi Rainer! >>> >>> Thank you so much for your help :) :) >>> >>> Well I assume they are in a datacenter and should not be a power outa= ge.... >>> >>> About dataset size... yes... our ones are big... they can be 3-4 TB e= asily each >>> dataset..... >>> >>> We bought them, because as they are for mailboxes and mailboxes grow = and >>> grow.... for having space for hosting them... >> >> Which mailbox format (e.g. mbox, maildir, ...) do you use? >> =C2=A0 >> *I'm running Cyrus imap so sort of Maildir... too many little files >> normally..... Sometimes directories with tons of little files....* Assuming that many mails are much smaller than the erase block size of th= e SSD, this may cause issues. (You may know the following ...) For example, if you have message sizes of 8 KB and an erase block size of= 64 KB (just guessing), then 8 mails will be in an erase block. If half the mail= s are deleted, then the erase block will still occupy 64 KB, but only hold 32 K= B of useful data (and the SSD will only be aware of this fact if TRIM has sign= aled which data is no longer relevant). The SSD will copy several partially fi= lled erase blocks together in a smaller number of free blocks, which then are = fully utilized. Later deletions will repeat this game, and your data will be co= pied multiple times until it has aged (and the user is less likely to delete f= urther messages). This leads to "write amplification" - data is internally moved= around and thus written multiple times. Larger mails are less of an issue since they span multiple erase blocks, = which will be completely freed when such a message is deleted. Samsung has a lot of experience and generally good strategies to deal wit= h such a situation, but SSDs specified for use in storage systems might be much = better suited for that kind of usage profile. >>> We knew they had some speed issues, but those speed issues, we though= t (as >>> Samsung explains in the QVO site) they started after exceeding the sp= eeding >>> buffer this disks have. We though that meanwhile you didn't exceed it= 's >>> capacity (the capacity of the speeding buffer) no speed problem arise= s. Perhaps >>> we were wrong?. >> >> These drives are meant for small loads in a typical PC use case, >> i.e. some installations of software in the few GB range, else only >> files of a few MB being written, perhaps an import of media files >> that range from tens to a few hundred MB at a time, but less often >> than once a day. >> =C2=A0 >> *We move, you know... lots of little files... and lot's of different >> concurrent modifications by 1500-2000 concurrent imap connections we h= ave...* I do not expect the read load to be a problem (except possibly when the S= SD is moving data from SLC to QLC blocks, but even then reads will get priority= ). But writes and trims might very well overwhelm the SSD, especially when its g= etting full. Keeping a part of the SSD unused (excluded from the partitions crea= ted) will lead to a large pool of unused blocks. This will reduce the write amplification - there are many free blocks in the "unpartitioned part" of= the SSD, and thus there is less urgency to compact partially filled blocks. (= E.g. if you include only 3/4 of the SSD capacity in a partition used for the Z= POOL, then 1/4 of each erase block could be free due to deletions/TRIM without = any compactions required to hold all this data.) Keeping a significant percentage of the SSD unallocated is a good strateg= y to improve its performance and resilience. >> As the SSD fills, the space available for the single level write >> cache gets smaller >> =C2=A0 >> *The single level write cache is the cache these ssd drivers have, for= >> compensating the speed issues they have due to using qlc memory?. Do y= ou >> refer to that?. Sorry I don't understand well this paragraph.* Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC c= ache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as 24 = GB of data in QLC mode. A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (600= GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells). Therefore, the fraction of the cells used as an SLC cache is reduced when= it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cells).= And with less SLC cells available for short term storage of data the probability of data being copied to QLC cells before the irrelevant messa= ges have been deleted is significantly increased. And that will again lead to= many more blocks with "holes" (deleted messages) in them, which then need to b= e copied possibly multiple times to compact them. >> (on many SSDs, I have no numbers for this >> particular device), and thus the amount of data that can be >> written at single cell speed shrinks as the SSD gets full. >> =C2=A0 >> >> >> I have just looked up the size of the SLC cache, it is specified >> to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB >> version, smaller models will have a smaller SLC cache). >> =C2=A0 >> *Assuming you were talking about the cache for compensating speed we >> previously commented, I should say these are the 870 QVO but the 8TB >> version. So they should have the biggest cache for compensating the sp= eed >> issues...* I have looked up the data: the larger versions of the 870 QVO have the sa= me SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB more = if there are enough free blocks. >> But after writing those few GB at a speed of some 500 MB/s (i.e. >> after 12 to 150 seconds), the drive will need several minutes to >> transfer those writes to the quad-level cells, and will operate >> at a fraction of the nominal performance during that time. >> (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the >> 2 TB model.) >> =C2=A0 >> *Well we are in the 8TB model. I think I have understood what you wrot= e in >> previous paragraph. You said they can be fast but not constantly, beca= use >> later they have to write all that to their perpetual storage from the = cache. >> And that's slow. Am I wrong?. Even in the 8TB model you think Stefan?.= * The controller in the SSD supports a given number of channels (e.g 4), ea= ch of which can access a Flash chip independently of the others. Small SSDs oft= en have less Flash chips than there are channels (and thus a lower throughpu= t, especially for writes), but the larger models often have more chips than channels and thus the performance is capped. In the case of the 870 QVO, the controller supports 8 channels, which all= ows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has only = 4 Flash chips and is thus limited to 80 MB/s in that situation, while the l= arger versions have 8, 16, or 32 chips. But due to the limited number of channe= ls, the write rate is limited to 160 MB/s even for the 8 TB model. If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in this= limit. >> *The main problem we are facing is that in some peak moments, when the= >> machine serves connections for all the instances it has, and only as s= aid in >> some peak moments... like the 09am or the 11am.... it seems the machin= e >> becomes slower... and like if the disks weren't able to serve all they= have >> to serve.... In these moments, no big files are moved... but as we hav= e >> 1800-2000 concurrent imap connections... normally they are doing each = one... >> little changes in their mailbox. Do you think perhaps this disks then = are >> not appropriate for this kind of usage?-* I'd guess that the drives get into a state in which they have to recycle = lots of partially free blocks (i.e. perform kind of a garbage collection) and = then three kinds of operations are competing with each other: 1. reads (generally prioritized) 2. writes (filling the SLC cache up to its maximum size) 3. compactions of partially filled blocks (required to make free blocks available for re-use) Writes can only proceed if there are sufficient free blocks, which on a f= illed SSD with partially filled erase blocks means that operations of type 3. n= eed to be performed with priority to not stall all writes. My assumption is that this is what you are observing under peak load. >> And cheap SSDs often have no RAM cache (not checked, but I'd be >> surprised if the QVO had one) and thus cannot keep bookkeeping date >> in such a cache, further limiting the performance under load. >> =C2=A0 >> *This brochure >> (https://semiconductor.samsung.com/resources/brochure/870_Series_Broch= ure.pdf >> and the datasheet >> https://semiconductor.samsung.com/resources/data-sheet/Samsung_SSD_870= _QVO_Data_Sheet_Rev1.1.pdf) >> sais if I have read properly, the 8TB drive has 8GB of ram?. I assume = that >> is what they call the turbo write cache?.* No, the turbo write cache consists of the cells used in SLC mode (which c= an be any cells, not only cells in a specific area of the flash chip). The RAM is needed for fast lookup of the position of data for reads and o= f free blocks for writes. There is no simple relation between SSD "block number" (in the sense of a= disk block on some track of a magnetic disk) and its storage location on the F= lash chip. If an existing "data block" (what would be a sector on a hard disk = drive) is overwritten, it is instead written at the end of an "open" erase block= , and a pointer from that "block number" to the location on the chip is stored = in an index. This index is written to Flash storage and could be read from it, = but it is much faster to have a RAM with these pointers that can be accessed independently of the Flash chips. This RAM is required for high transacti= on rates (especially random reads), but it does not really help speed up wri= tes. >> And the resilience (max. amount of data written over its lifetime) >> is also quite low - I hope those drives are used in some kind of >> RAID configuration. >> =C2=A0 >> *Yep we use raidz-2* Makes sense ... But you know that you multiply the amount of data written= due to the redundancy. If a single 8 KB block is written, for example, 3 * 8 KB will written if = you take the 2 redundant copies into account. >> The 870 QVO is specified for 370 full capacity >> writes, i.e. 370 TB for the 1 TB model. That's still a few hundred >> GB a day - but only if the write amplification stays in a reasonable >> range ... >> =C2=A0 >> *Well yes... 2880TB in our case....not bad.. isn't it?* I assume that 2880 TB is your total storage capacity? That's not too bad,= in fact. ;-) This would be 360 * 8 TB ... Even at 160 MB/s per 8 TB SSD this would allow for more than 50 GB/s of w= rite throughput (if all writes were evenly distributed). Taking all odds into account, I'd guess that at least 10 GB/s can be continuously written (if supported by the CPUs and controllers). But this may not be true if the drive is simultaneously reading, trimming= , and writing ... I have seen advice to not use compression in a high load scenario in some= other reply. I tend to disagree: Since you seem to be limited when the SLC cache is exhausted, you should get better performance if you compress your data. I= have found that zstd-2 works well for me (giving a significant overall reducti= on of size at reasonable additional CPU load). Since ZFS allows to switch compressions algorithms at any time, you can experiment with different algorithms and levels. One advantage of ZFS compression is that it applies to the ARC, too. And = a compression factor of 2 should easily be achieved when storing mail (not = for =2Edocx, .pdf, .jpg files though). Having more data in the ARC will reduc= e the read pressure on the SSDs and will give them more cycles for garbage collections (which are performed in the background and required to always= have a sufficient reserve of free flash blocks for writes). I'd give it a try - and if it reduces your storage requirements by 10% on= ly, then keep 10% of each SSD unused (not assigned to any partition). That wi= ll greatly improve the resilience of your SSDs, reduce the write-amplificati= on, will allow the SLC cache to stay at its large value, and may make a large= difference to the effective performance under high load. Regards, STefan ** --------------ASZf2V7VPlu8lceG03SodTv5 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Am 06.04.22 um 18:34 schrieb egoitz@ramattack.net:

Hi Stefan!

Thank you so much for your answer!!. I do answer below in green bold for instance... for a better distinction....

Very thankful for all your comments Stefan!!! :) :) :)

Cheers!!

Hi,

glad to hear that it is useful information - I'll add comments below ...

El 2022-04-06 17:43, Stefan Esser escribi=C3=B3:

ATENCION
ATENCION
ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro.

Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net:
Hi Rainer!

Thank you so much for your help :) :)

Well I assume they are in a datacenter and should not be a power outage....

About dataset size... yes... our ones are big... they can be 3-4 TB easily each
dataset.....

We bought them, because as they are for mailboxes and mailboxes grow and
grow.... for having space for hosting them...

Which mailbox format (e.g. mbox, maildir, ...) do you use?
=C2=A0
I'm running Cyrus imap so sort of Maildir... too many little files normally..... Sometimes directories with tons of little files....

Assuming that many mails are much smaller than the erase block size of the SSD, this may cause issues. (You may know the following ...)

For example, if you have message sizes of 8 KB and an erase block size of 64 KB (just guessing), then 8 mails will be in an erase block. If half the mails are deleted, then the erase block will still occupy 64 KB, but only hold 32 KB of useful data (and the SSD will only be aware of this fact if TRIM has signaled which data is no longer relevant). The SSD will copy several partially filled erase blocks together in a smaller number of free blocks, which then are fully utilized. Later deletions will repeat this game, and your data will be copied multiple times until it has aged (and the user is less likely to delete further messages). This leads to "write amplification" - data is internally moved around and thus written multiple times.

Larger mails are less of an issue since they span multiple erase blocks, which will be completely freed when such a message is deleted.

Samsung has a lot of experience and generally good strategies to deal with such a situation, but SSDs specified for use in storage systems might be much better suited for that kind of usage profile.

We knew they had some speed issues, but those speed issues, we thought (as
Samsung explains in the QVO site) they started after exceeding the speeding
buffer this disks have. We though that meanwhile you didn't exceed it's
capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps
we were wrong?.

These drives are meant for small loads in a typical PC use case,
i.e. some installations of software in the few GB range, else only
files of a few MB being written, perhaps an import of media files
that range from tens to a few hundred MB at a time, but less often
than once a day.
=C2=A0
We move, you= know... lots of little files... and lot's of different concurrent modifications by 1500-2000 concurrent imap connections we have...

I do not expect the read load to be a problem (except possibly when the SSD is moving data from SLC to QLC blocks, but even then reads will get priority). But writes and trims might very well overwhelm the SSD, especially when its getting full. Keeping a part of the SSD unused (excluded from the partitions created) will lead to a large pool of unused blocks. This will reduce the write amplification - there are many free blocks in the "unpartitioned part" of the SSD, and thus there is less urgency to compact partially filled blocks. (E.g. if you include only 3/4 of the SSD capacity in a partition used for the ZPOOL, then 1/4 of each erase block could be free due to deletions/TRIM without any compactions required to hold all this data.)

Keeping a significant percentage of the SSD unallocated is a good strategy to improve its performance and resilience.

As the SSD fills, the space available for the single level write
cache gets smaller
=C2=A0
The single level write cache is the cache these ssd drivers have, for compensating the speed issues they have due to using qlc memory?. Do you refer to that?. Sorry I don't understand well this paragraph.

Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as 24 GB of data in QLC mode.

A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (600 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells).

Therefore, the fraction of the cells used as an SLC cache is reduced when it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cells).

And with less SLC cells available for short term storage of data the probability of data being copied to QLC cells before the irrelevant messages have been deleted is significantly increased. And that will again lead to many more blocks with "holes" (deleted messages) in them, which then need to be copied possibly multiple times to compact them.

(on many SSDs, I have no numbers for this
particular device), and thus the amount of data that can be
written at single cell speed shrinks as the SSD gets full.
=C2=A0


I have just looked up the size of the SLC cache, it is specified
to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB
version, smaller models will have a smaller SLC cache).
=C2=A0
Assuming you= were talking about the cache for compensating speed we previously commented, I should say these are the 870 QVO but the 8TB version. So they should have the biggest cache for compensating the speed issues...

I have looked up the data: the larger versions of the 870 QVO have the same SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB more if there are enough free blocks.

But after writing those few GB at a speed of some 500 MB/s (i.e.
after 12 to 150 seconds), the drive will need several minutes to
transfer those writes to the quad-level cells, and will operate
at a fraction of the nominal performance during that time.
(QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the
2 TB model.)
=C2=A0
Well we are in the 8TB model. I think I have understood what you wrote in previous paragraph. You said they can be fast but not constantly, because later they have to write all that to their perpetual storage from the cache. And that's slow. Am I wrong?. Even in the 8TB model you think Stefan?.

The controller in the SSD supports a given number of channels (e.g 4), each of which can access a Flash chip independently of the others. Small SSDs often have less Flash chips than there are channels (and thus a lower throughput, especially for writes), but the larger models often have more chips than channels and thus the performance is capped.

In the case of the 870 QVO, the controller supports 8 channels, which allows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has only 4 Flash chips and is thus limited to 80 MB/s in that situation, while the larger versions have 8, 16, or 32 chips. But due to the limited number of channels, the write rate is limited to 160 MB/s even for the 8 TB model.

If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in this limit.

The main problem we are facing is that in some peak moments, when the machine serves connections for all the instances it has, and only as said in some peak moments... like the 09am or the 11am.... it seems the machine becomes slower... and like if the disks weren't able to serve all they have to serve.... In these moments, no big files are moved... but as we have 1800-2000 concurrent imap connections... normally they are doing each one... little changes in their mailbox. Do you think perhaps this disks then are not appropriate for this kind of usage?-<= /span>

I'd guess that the drives get into a state in which they have to recycle lots of partially free blocks (i.e. perform kind of a garbage collection) and then three kinds of operations are competing with each other:

  1. reads (generally prioritized)
  2. writes (filling the SLC cache up to its maximum size)
  3. compactions of partially filled blocks (required to make free blocks available for re-use)

Writes can only proceed if there are sufficient free blocks, which on a filled SSD with partially filled erase blocks means that operations of type 3. need to be performed with priority to not stall all writes.

My assumption is that this is what you are observing under peak load.

And cheap SSDs often have no RAM cache (not checked, but I'd be
surprised if the QVO had one) and thus cannot keep bookkeeping date
in such a cache, further limiting the performance under load.
=C2=A0
This brochur= e (= https://semiconductor.samsung.com/resources/brochure/870_Series_Brochure.= pdf and the datasheet https= ://semiconductor.samsung.com/resources/data-sheet/Samsung_SSD_870_QVO_Dat= a_Sheet_Rev1.1.pdf) sais if I have read properly, the 8TB drive has 8GB of ram?. I assume that is what they call the turbo write cache?.

No, the turbo write cache consists of the cells used in SLC mode (which can be any cells, not only cells in a specific area of the flash chip).

The RAM is needed for fast lookup of the position of data for reads and of free blocks for writes.

There is no simple relation between SSD "block number" (in the sense of a disk block on some track of a magnetic disk) and its storage location on the Flash chip. If an existing "data block" (what would be a sector on a hard disk drive) is overwritten, it is instead written at the end of an "open" erase block, and a pointer from that "block number" to the location on the chip is stored in an index. This index is written to Flash storage and could be read from it, but it is much faster to have a RAM with these pointers that can be accessed independently of the Flash chips. This RAM is required for high transaction rates (especially random reads), but it does not really help speed up writes.

And the resilience (max. amount of data written over its lifetime)
is also quite low - I hope those drives are used in some kind of
RAID configuration.
=C2=A0
Yep we use raidz-2

Makes sense ... But you know that you multiply the amount of data written due to the redundancy.

If a single 8 KB block is written, for example, 3 * 8 KB will written if you take the 2 redundant copies into account.

The 870 QVO is specified for 370 full capacity
writes, i.e. 370 TB for the 1 TB model. That's still a few hundred
GB a day - but only if the write amplification stays in a reasonable
range ...
=C2=A0
Well yes... 2880TB in our case....not bad.. isn't it?

I assume that 2880 TB is your total storage capacity? That's not too bad, in fact. ;-)

This would be 360 * 8 TB ...

Even at 160 MB/s per 8 TB SSD this would allow for more than 50 GB/s of write throughput (if all writes were evenly distributed).

Taking all odds into account, I'd guess that at least 10 GB/s can be continuously written (if supported by the CPUs and controllers).

But this may not be true if the drive is simultaneously reading, trimming, and writing ...


I have seen advice to not use compression in a high load scenario in some other reply.

I tend to disagree: Since you seem to be limited when the SLC cache is exhausted, you should get better performance if you compress your data. I have found that zstd-2 works well for me (giving a significant overall reduction of size at reasonable additional CPU load). Since ZFS allows to switch compressions algorithms at any time, you can experiment with different algorithms and levels.

One advantage of ZFS compression is that it applies to the ARC, too. And a compression factor of 2 should easily be achieved when storing mail (not for .docx, .pdf, .jpg files though). Having more data in the ARC will reduce the read pressure on the SSDs and will give them more cycles for garbage collections (which are performed in the background and required to always have a sufficient reserve of free flash blocks for writes).

I'd give it a try - and if it reduces your storage requirements by 10% only, then keep 10% of each SSD unused (not assigned to any partition). That will greatly improve the resilience of your SSDs, reduce the write-amplification, will allow the SLC cache to stay at its large value, and may make a large difference to the effective performance under high load.

Regards, STefan

--------------ASZf2V7VPlu8lceG03SodTv5-- --------------OGljiRSjG08yilMHhaNFyDtW-- --------------Lic1usorjc8S7FifC6L0nUmq Content-Type: application/pgp-signature; name="OpenPGP_signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="OpenPGP_signature" -----BEGIN PGP SIGNATURE----- wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmJOCtsFAwAAAAAACgkQR+u171r99USt jAf+Jqn2i8WZUDjj7wNiYznxQzyyjhsmvUb2d7NygsZaC0lcdNuEpjkhWG+Cn7tc5mPuWkbP2nz0 HGpERxnAnf+6chQw6E/3ZXVKCBM+HdiVw1HpmnX91K5FiLecnPC8aD5VlFsrGg7LpTtKBLCwgwls ssSPRqJvI5wYZEsiGydp/nMcaJeruVOXpjwH7kUDy5HvANKOdtM3X2JJMxHbwPsqtbwo8nGAiE9r NaNLI9hO0Ljfud4rgCaHo0dWq9sD9zAKOvbmDbSGgQbgVXxgh2Oz+lVmfHfC6MMGcM57K1HJ+LnH QXhvEcKQif+HVq2LNpwkTRjJdY4a0ajeGSIFS5FXuA== =LAcb -----END PGP SIGNATURE----- --------------Lic1usorjc8S7FifC6L0nUmq-- From nobody Wed Apr 6 21:59:50 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 928531A82E25; Wed, 6 Apr 2022 21:59:54 +0000 (UTC) (envelope-from se@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYdj63FN7z3GN9; Wed, 6 Apr 2022 21:59:54 +0000 (UTC) (envelope-from se@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649282394; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=xS37a09f2j+D6oXvbNy3JpSp355g06Zi0IFCWDhLayM=; b=codISLKMW2nW2y3AwW5BRK1iehmBlyuvYdeaV45nRx3GMHJmC4sceWUpcsJ+Zns+6Afwy5 s5wx3XGj9OIFWP1TdhEMbEBQ94UGRi3HBGA4B6QbqTZOJ210JlHKNxvG2a35h9FuNrCA10 xsGKmNE0XyRpflCnBt6oOtkus3oJTFYPQobA+Z35sogdsmyZboLBz8j8XWMzv9WdD2y4nq kL+hxgzidxJ+TXqxE4aVD492T8K6Cax9tUW7vMYIXmrR33UAbCMiTZlTBfBu1I8IKdvjHn lzpqtbtUQzI32m4mymNrmM5QEhbxV/1mh/DjK0uOLy6XmAaeAVfGZMPE5xdK6A== Received: from [IPV6:2003:cd:5f22:6f00:953e:7ee1:500e:87a1] (p200300cd5f226f00953e7ee1500e87a1.dip0.t-ipconnect.de [IPv6:2003:cd:5f22:6f00:953e:7ee1:500e:87a1]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: se/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id 9F8A35E79; Wed, 6 Apr 2022 21:59:52 +0000 (UTC) (envelope-from se@FreeBSD.org) Message-ID: <0702dc56-28ba-7e99-d599-1036634d79e3@FreeBSD.org> Date: Wed, 6 Apr 2022 23:59:50 +0200 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS Content-Language: en-US To: mike tancsa , Bob Friesenhahn , egoitz@ramattack.net Cc: freebsd-fs@FreeBSD.org, freebsd-hackers@FreeBSD.org, Freebsd performance References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> From: Stefan Esser In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------2AmGM1Q6OeFqsRTohjQ1sr1u" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649282394; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=xS37a09f2j+D6oXvbNy3JpSp355g06Zi0IFCWDhLayM=; b=LG0qHNBBEyPUrLC60qVGeR9xgTyLM5oZFNXhLwMLBBaRPMu69owI+TFY1SLZAQA+fKffIw Ap0T6ghK7ux71ilX8YdGf3JJJXQIxFHcUg57naKbEahuMjvJIyn/QJjlkdpfaLrJOnAvAY +Puxtjkjf5qsVCc/B8kruuj1F9auIA6BGnWMWK4R8eMD7bnENKrLwdtgen9Cn3yfT7Wf0Q ocAwgNhSQcQX2hVPSCfVMkslG+FoArobAVf+h4n605/Hos15KBw6k2f8bYnncNhrKR1clJ 4CAbni3TBA0HwoGgNe5jvSwAg7QjkFB39EXY1OHN92k11DrcTO/rGWM5mrXttw== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1649282394; a=rsa-sha256; cv=none; b=lc/d+Wp/isGygaYGCYvCBC5sW1+xGOXpnqBgsIIz8g3DE8mMdvkZXn42D4kKR8k5lBH+E5 qUoRfRm5H963+2IunYSSl12gbgpopEe409ATzxTktT7W3shhY5cfy+598s6S5tXahWI948 vi0Py1nJLg5OwaT/Dw1iRM9En3Vd7+fdfZH94d3mftlfszzC5NQkmfZrKjyBSt64MWfLC8 6xKUQpfKBhU6LGXOLZz9NCo/ebgFjjjE/RLQSnVtH6qqxfT00obfsnm+iPT+91ivxUkUgF s8GCA0Kcs90NNQG31zGDSxWEBPw896Yfcyx824EMvtFhYVYuoFzarg9G0U0QBg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------2AmGM1Q6OeFqsRTohjQ1sr1u Content-Type: multipart/mixed; boundary="------------lhrXooAciDl05L70ht5xs4WV"; protected-headers="v1" From: Stefan Esser To: mike tancsa , Bob Friesenhahn , egoitz@ramattack.net Cc: freebsd-fs@FreeBSD.org, freebsd-hackers@FreeBSD.org, Freebsd performance Message-ID: <0702dc56-28ba-7e99-d599-1036634d79e3@FreeBSD.org> Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> In-Reply-To: --------------lhrXooAciDl05L70ht5xs4WV Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Am 06.04.22 um 22:43 schrieb mike tancsa: > On 4/6/2022 4:18 PM, Bob Friesenhahn wrote: >> On Wed, 6 Apr 2022, egoitz@ramattack.net wrote: >>>> >>>> WE DON'T USE COMPRESSION AS IT'S NOT SET BY DEFAULT. SOME PEOPLE SAY= YOU >>>> SHOULD HAVE IT ENABLED.... BUT.... JUST FOR AVOID HAVING SOME DATA >>>> COMPRESSED SOME OTHER NOT (IN CASE YOU ENABLE AND LATER DISABLE) AND= >>>> FINALLY FOR AVOID ACCESSING TO INFORMATION WITH DIFFERENT CPU COSTS = OF >>>> HANDLING... WE HAVE NOT TOUCHED COMPRESSION.... >> >> There seems to be a problem with your caps-lock key. >> >> Since it seems that you said that you are using maildir for your mail = server, >> it is likely very useful if you do enable even rather mild compression= (e.g. >> lz4) since this will reduce the write work-load and even short files w= ill be >> stored more efficiently. >> > FYI, a couple of our big zfs=C2=A0 mailspools sees a 1.24x and 1.23x co= mpress ratio > with lz4.=C2=A0 We use Maildir format as well.=C2=A0 They are not RELEN= G_13 so not sure > how zstd would fair. I have got much better compression at same or less load by use of zstd-2 compared to lz4. Perhaps not typical, since this is a dovecot mdbox formatted mail pool holding mostly plain text messages without large attachments: $ df /var/mdbox Filesystem 1K-blocks Used Avail Capacity Mounted on system/var/mdbox 7234048944 9170888 7224878056 0% /var/mdbox $ zfs get compression,compressratio,used,logicalused system/var/mdbox NAME PROPERTY VALUE SOURCE system/var/mdbox compression zstd-2 inherited from system/va= r system/var/mdbox compressratio 2.29x - system/var/mdbox used 8.76G - system/var/mdbox logicalused 20.0G - Regards, STefan --------------lhrXooAciDl05L70ht5xs4WV-- --------------2AmGM1Q6OeFqsRTohjQ1sr1u Content-Type: application/pgp-signature; name="OpenPGP_signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="OpenPGP_signature" -----BEGIN PGP SIGNATURE----- wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmJODVYFAwAAAAAACgkQR+u171r99UQH 3ggAjrm6XN5FfQ0OjamthevN4N2u/LyQAdJtTxB2Ta2BHBMDSrH969d/Ed7zHbpATNmtiGCDEVGS 5P93HzFDeE9dZRIp2yvIPOKhm1xuNlaK6anw31yMOhuTs+HkFZrW85/lqtkUyJgga82We9r7yTl7 JWsuLlNrBR6PmVLyi4GztB7wEF9JOd4oTCIye+001uP8OHtto4h5mX5APubPEaqZKk0mTCenfptZ ZVG+9q8xVj04jRp2Gia87XeODZ/Cvx1KsGzz5r+o2QBPxGKyO8Y7AaHWrRiJT4Rj6HW8T4F2ooNn x7rBKOwEUvnIQqDu7Pz8fQtIMVdv2Atc+SizWpARRA== =YgKS -----END PGP SIGNATURE----- --------------2AmGM1Q6OeFqsRTohjQ1sr1u-- From nobody Thu Apr 7 07:59:27 2022 X-Original-To: performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 05C4C1A97DC7; Thu, 7 Apr 2022 07:59:35 +0000 (UTC) (envelope-from se@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYv126R78z3KyW; Thu, 7 Apr 2022 07:59:34 +0000 (UTC) (envelope-from se@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649318375; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=m2HUVrbEq7o9ase+zDUCti07l48rJ+nU2EpdAQMDQ38=; b=NMnF19bzpHf2nHHVB1YxAjcApV7iALDxLIvW6idngan6XSg0GeYmvFdpcwm2/zvQ/+Ga4J 7ek0Evu1q19J1y6m+FT/veoZWm7xFnz6pm8EKkXfay5gVNWeCpyihq0zFetyPVKrEXXYR8 yHcEibj27FCtdbcevPD3Huzq+9UgVKon1u8uPyO8MCPaqO52ZV5LU72ERviW6LSs80ewxK eYIs9tydKw1vh2ZQY/fd8lGOv4pJXhIThInO1WiOFFXkK60CWjIuLkJ9Vqh9KQYrWVbkBx Jel3ssLOEUyAw+tVxjCASoU6//o3kZdel7elQKdv7Ei/vNs+k/mySkoWbINFMQ== Received: from [IPV6:2003:cd:5f22:6f00:34ed:cacb:3b28:daf] (p200300cd5f226f0034edcacb3b280daf.dip0.t-ipconnect.de [IPv6:2003:cd:5f22:6f00:34ed:cacb:3b28:daf]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: se/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id D6D7EA066; Thu, 7 Apr 2022 07:59:33 +0000 (UTC) (envelope-from se@FreeBSD.org) Message-ID: <49f43af5-e145-c793-959d-ab1596421d81@FreeBSD.org> Date: Thu, 7 Apr 2022 09:59:27 +0200 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS Content-Language: en-US To: egoitz@ramattack.net References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> From: Stefan Esser Cc: Jan Bramkamp , performance@freebsd.org, "freebsd-fs@freebsd.org" , FreeBSD Hackers In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------o4K5Sngmd3lLPnmlzLRzMtEK" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649318374; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=m2HUVrbEq7o9ase+zDUCti07l48rJ+nU2EpdAQMDQ38=; b=IWML0cH5d0UJDKTcEQqANYFGYOX7ABIcJAQ+YDeHZ4QAbaUZSpBJrCNxrZZsrvCAkwnMTY K74KzXtXJVTxHPp78gm2hLlPomWFSHeFMV3VJPRXJ+PFFuX4SA+nhQcOh28iO9JCM7hyJX PLLdH0kXGvvIq2o0ovvwQ0Liy7kQTp0BKz19FyYcMsGRzHAg7kzduV535q1Pw7JFIHoR2T IYbeGPMhAXrLbNSZyxLz6AKgw95LS8srBqWqkSKFmlr22ZpoEjXdnYCUy2EnPz/7WON5RX RxN3OqalkAJKygy71QOovZXkBhNZUej5Mwz4dAaEzOaAOJDe5HgTWrD17kqW+Q== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1649318374; a=rsa-sha256; cv=none; b=rApmUo+lAfX/P6o4SLQ68AGoD/Sx5BGrWCkFyCWD0/bXXVIBQWGs+Pi6shFlSIB6ZB5gIj ElK74KW/VSc62QR/K6s27Ji9I9iZy36zaE+AviLY5VURhHI0tyjl2jAF/AjsHaQQj0zqoh qNk++uhsi7umP6jnoQXI0DIXi8k98Jx48smYnExyveQfodOG60etpx2AgF/fT50iwOCtVQ bYbSc2Q5JFOBf4CRadmSUa8EMDNTVBnWlrJK6FGjN6mzMkisWg+s+N+S7z0ZK+jxMyXyAH NDvOmGPsf+5g3aWQz7Y5dNLIGe8PsGQ1Yx4n6FftSPBRddbI0b5akns9ejNw3w== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------o4K5Sngmd3lLPnmlzLRzMtEK Content-Type: multipart/mixed; boundary="------------ce4Isv9Wdp6yhBkTY0YsttW0"; protected-headers="v1" From: Stefan Esser To: egoitz@ramattack.net Cc: Jan Bramkamp , performance@freebsd.org, "freebsd-fs@freebsd.org" , FreeBSD Hackers Message-ID: <49f43af5-e145-c793-959d-ab1596421d81@FreeBSD.org> Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> In-Reply-To: --------------ce4Isv9Wdp6yhBkTY0YsttW0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Am 06.04.22 um 23:19 schrieb Jan Bramkamp: > On 06.04.22 22:43, mike tancsa wrote: >> On 4/6/2022 4:18 PM, Bob Friesenhahn wrote: >>> On Wed, 6 Apr 2022, egoitz@ramattack.net wrote: >>>>> >>>>> WE DON'T USE COMPRESSION AS IT'S NOT SET BY DEFAULT. SOME PEOPLE SA= Y YOU >>>>> SHOULD HAVE IT ENABLED.... BUT.... JUST FOR AVOID HAVING SOME DATA >>>>> COMPRESSED SOME OTHER NOT (IN CASE YOU ENABLE AND LATER DISABLE) AN= D >>>>> FINALLY FOR AVOID ACCESSING TO INFORMATION WITH DIFFERENT CPU COSTS= OF >>>>> HANDLING... WE HAVE NOT TOUCHED COMPRESSION.... >>> >>> There seems to be a problem with your caps-lock key. >>> >>> Since it seems that you said that you are using maildir for your mail= >>> server, it is likely very useful if you do enable even rather mild >>> compression (e.g. lz4) since this will reduce the write work-load and= even >>> short files will be stored more efficiently. >>> >> FYI, a couple of our big zfs=C2=A0 mailspools sees a 1.24x and 1.23x c= ompress >> ratio with lz4.=C2=A0 We use Maildir format as well.=C2=A0 They are no= t RELENG_13 so >> not sure how zstd would fair. > I've found that Dovecot's mdbox format compresses a lot better than Mai= ldir (or > sdbox), because it stores multiple messages per file resulting in files= large > enough to contain enough exploitable reduncancy to compress down to the= next > smaller blocksize. In a corporate or education environment where users = tend to > send the same medium to large attachments multiple times to multiple re= cipients > on the same server Dovecot's single instance storage is a game changer.= It > reduced my IMAP storage requirements by a *factor* of 4.7 which allowed= me to > get rid of spinning disks for the mail servers instead of playing losin= g games > with hybrid storage. Dovecot also supports zlib compression in the appl= ication > instead of punting it to the file system. I don't know if Cyrus IMAP of= fers > similar features, but if it does I would recommend evaluating them inst= ead of > compressing or deduplicating at the file system level. I have not compared dovecot's zlib compression with zstd-2 on the file sy= stem, but since I use the latter on all my ZFS file systems (excepts those that= exclusively hold compressed files and media), I'm using it for Dovecot md= box files, too. I get a compression ratio of 2,29 with ZFS zstd-2, maybe I sh= ould copy the files over into a zlib compressed mdbox for comparison ... One large advantage of the mdbox format in the context of the mail server= set-up at the start of this thread is that deletions are only registered = in an index file (while mbox needs a rewrite of potentially large parts of t= he mail folder and mdir immediately deletes files (TRIM) and updates inodes = and directory entries, causing multiple writes per deleted message). With mdbox you can delay all "expensive" file system operations to the point of least load each day, for example. Such a compression run is also= well suited for SSDs, since it does not perform random updates that punch= holes in a large number of erase blocks (which then will need to be garba= ge collected, causing write amplification to put further load and stress on the SSD). --------------ce4Isv9Wdp6yhBkTY0YsttW0-- --------------o4K5Sngmd3lLPnmlzLRzMtEK Content-Type: application/pgp-signature; name="OpenPGP_signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="OpenPGP_signature" -----BEGIN PGP SIGNATURE----- wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmJOmd8FAwAAAAAACgkQR+u171r99UQz fQgAj/0scy7zAbl1SoRPExnKQSTSk320RX81cVGflFCk2hDHRKeF9bScO22aZil0nYKaCKPR1Mps kyujxrmFpTgWjrjhNe7noHe5sz3LlGplXB7YNMKr0eujF1VC9YrlSvQLGTDFJeJyIRcI7EjSAgoy 8aLZjMG8rI7XCiMo1y+bpJqyWsElxYFoiomi2h2fkZ5MFZtWfyqPNaCFd2e4YHW6x6WYS9/H9ZDc k3E6GoFUaoRIuHC5dw2HbeUxBr72TYRLZgSH5pMBlk0cN60QNHkOBiBahbyzPVA/U1IJqCSVTWJP 7ZUNURO6nTn71io0o148Jho31IitSdxA4cuaJOOIVg== =DqKu -----END PGP SIGNATURE----- --------------o4K5Sngmd3lLPnmlzLRzMtEK-- From nobody Thu Apr 7 08:49:15 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 388251AA426B; Thu, 7 Apr 2022 08:49:21 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu01208b.smtpx.saremail.com (cu01208b.smtpx.saremail.com [195.16.151.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYw6R6nMZz3k6B; Thu, 7 Apr 2022 08:49:19 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend01.sarenet.es (Postfix) with ESMTPA id DADEA60C641; Thu, 7 Apr 2022 10:49:15 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_85ed0b4a49488ec1a940cb1f7fed0376" Date: Thu, 07 Apr 2022 10:49:15 +0200 From: egoitz@ramattack.net To: Eugene Grosbein Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance Subject: Re: {* 05.00 *}Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> Message-ID: <55000f00fb64510e8ef6b8ad858d8855@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYw6R6nMZz3k6B X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.151.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.151.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_85ed0b4a49488ec1a940cb1f7fed0376 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Good morning Eugene!! Thank you so much for your help mate :) :) really :) :) Ok I take good notes of all you have replied me below :) :) Very very thankful for your help really :) Cheers, El 2022-04-06 20:10, Eugene Grosbein escribió: > ATENCION > ATENCION > ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > 06.04.2022 23:51, egoitz@ramattack.net wrote: > >> About your recommendations... Eugene, if some of them wouldn't be working as expected, >> could we revert some or all of them > > Yes, it all can be reverted. > Just write down original sysctl values if you are going to change it. > >> 1) Make sure the pool has enough free space because ZFS can became crawling slow otherwise. >> >> *This is just an example... but you can see all similarly....* >> >> *zpool list* >> *NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT* >> *zroot 448G 2.27G 446G - - 1% 0% 1.00x ONLINE -* >> *mail_dataset 58.2T 19.4T 38.8T - - 32% 33% 1.00x ONLINE -* > > It's all right. > >> 2) Increase recordsize upto 1MB for file systems located in the pool >> so ZFS is allowed to use bigger request sizes for read/write operations >> >> *We have the default... so 128K...* > > It will not hurt increasing it upto 1MB. > >> 5) If you have good power supply and stable (non-crashing) OS, try increasing >> sysctl vfs.zfs.txg.timeout from defaule 5sec, but do not be extreme (f.e. upto 10sec). >> Maybe it will increase amount of long writes and decrease amount of short writes, that is good. >> >> *Well I have sync in disabled in the datasets... do you still think it's good to change it? > > Yes, try it. Disabling sync makes sense if you have lots of fsync() operations > but other small writes are not affected unless you raise vfs.zfs.txg.timeout > >> *What about the vfs.zfs.dirty_data_max and the vfs.zfs.dirty_data_max_max, would you increase them from 4GB it's set now?.* > > Never tried that and cannot tell. --=_85ed0b4a49488ec1a940cb1f7fed0376 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Good morning Eugene!!


Thank you so much for your help mate :) :) really :) :)


Ok I take good notes of all you have replied me below :) :)


Very very thankful for your help really :)


Cheers,

 


El 2022-04-06 20:10, Eugene Grosbein escribió:

= ATENCION
ATENCION
ATENCION!!! Este correo se ha enviado desde f= uera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no= ser que reconozca el remitente y sepa que el contenido es seguro.
06.04.2022 23:51, egoitz@ramat= tack.net wrote:

About your recommendations... Eugene, if some of them = wouldn't be working as expected,
could we revert some or all of them<= /blockquote>
Yes, it all can be reverted.
Just write down original sysctl v= alues if you are going to change it.

1) Make sure the pool has enough free space because ZF= S can became crawling slow otherwise.
 
*This is just an e= xample... but you can see all similarly....*
 
*zpool list= *
*NAME           &= nbsp; SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ &= nbsp; FRAG    CAP  DEDUP  HEALTH  ALTROO= T*
*zroot           = ;  448G  2.27G   446G     &nbs= p;  -         -  &nb= sp;  1%     0%  1.00x  ONLINE  = ;-*
*mail_dataset  58.2T  19.4T  38.8T   &nb= sp;    -        &nbs= p;-    32%    33%  1.00x  ONLINE &n= bsp;-*

It's all right.

2) Increase recordsize upto 1MB for file systems locat= ed in the pool
so ZFS is allowed to use bigger request sizes for read= /write operations
 
*We have the default... so 128K...*
It will not hurt increasing it upto 1MB.

5) If you have good power supply and stable (non-crash= ing) OS, try increasing
sysctl vfs.zfs.txg.timeout from defaule 5sec,= but do not be extreme (f.e. upto 10sec).
Maybe it will increase amou= nt of long writes and decrease amount of short writes, that is good.
=  
*Well I have sync in disabled in the datasets... do you still = think it's good to change it?

Yes, try it. Disabling sync makes sense if you have lots of fsync() = operations
but other small writes are not affected unless you raise v= fs.zfs.txg.timeout

*What about the vfs.zfs.dirty_data_max and the vfs.zfs= =2Edirty_data_max_max, would you increase them from 4GB it's set now?.*
Never tried that and cannot tell.
--=_85ed0b4a49488ec1a940cb1f7fed0376-- From nobody Thu Apr 7 08:56:26 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id C741A1AA6C09; Thu, 7 Apr 2022 08:56:30 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYwGj2cvcz3mnC; Thu, 7 Apr 2022 08:56:28 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id 6808A60C149; Thu, 7 Apr 2022 10:56:26 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_d38b5728c4674c13e58315ae3b5e7d2c" Date: Thu, 07 Apr 2022 10:56:26 +0200 From: egoitz@ramattack.net To: Bob Friesenhahn Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance , owner-freebsd-fs@freebsd.org Subject: Re: {* 05.00 *}Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> Message-ID: X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYwGj2cvcz3mnC X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_d38b5728c4674c13e58315ae3b5e7d2c Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Bob! Thank you so much really for your comments :) :) Wow! I wouldn't have wanted to write in capital letters.... I would have sworn not to have done.... Apologies for that really ..... Note taking mate. We didn't changed almost nothing than the sync param, for avoid modifying the most we could the default config of ZFS. We thought it could perhaps be the most stable config and we have not disk space problems so... Apart of that, for avoid load coming from compression/decompression.... although we have lots of cpu too..... Thanks a lot Bob :) Cheers, El 2022-04-06 22:18, Bob Friesenhahn escribió: > ATENCION > ATENCION > ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > On Wed, 6 Apr 2022, egoitz@ramattack.net wrote: > WE DON'T USE COMPRESSION AS IT'S NOT SET BY DEFAULT. SOME PEOPLE SAY YOU SHOULD HAVE IT ENABLED.... BUT.... JUST FOR AVOID HAVING SOME DATA COMPRESSED SOME OTHER NOT (IN CASE YOU ENABLE AND LATER DISABLE) AND FINALLY FOR AVOID ACCESSING TO INFORMATION WITH DIFFERENT CPU COSTS OF HANDLING... WE HAVE NOT TOUCHED COMPRESSION.... There seems to be a problem with your caps-lock key. Since it seems that you said that you are using maildir for your mail server, it is likely very useful if you do enable even rather mild compression (e.g. lz4) since this will reduce the write work-load and even short files will be stored more efficiently. Bob --=_d38b5728c4674c13e58315ae3b5e7d2c Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Bob!


Thank you so much really for your comments :) :)


Wow! I wouldn't have wanted to write in capital letters.... I would have= sworn not to have done.... Apologies for that really .....


Note taking mate. We didn't changed almost nothing than the sync param, = for avoid modifying the most we could the default config of ZFS. We thought= it could perhaps be the most stable config and we have not disk space prob= lems so... Apart of that, for avoid load coming from compression/decompress= ion.... although we have lots of cpu too.....


Thanks a lot Bob :)


Cheers,

 


El 2022-04-06 22:18, Bob Friesenhahn escribió:

= ATENCION
ATENCION
ATENCION= !!! Este correo se ha enviado desde fuer= a de la organizacion. No pinche en los&n= bsp;enlaces ni abra los adjuntos a no se= r que reconozca el remitente y sepa que&= nbsp;el contenido es seguro.

On Wed, 6 Apr 2022, egoitz@ramattack.net wrote:

WE DO= N'T USE COMPRESSION AS IT'S NOT SET BY&n= bsp;DEFAULT. SOME PEOPLE SAY YOU SHOULD HAVE&= nbsp;IT ENABLED.... BUT.... JUST FOR AVOID HA= VING SOME DATA COMPRESSED SOME OTHER NOT = ;(IN CASE YOU ENABLE AND LATER DISABLE) = AND FINALLY FOR AVOID ACCESSING TO INFORMATIO= N WITH DIFFERENT CPU COSTS OF HANDLING...&nbs= p;WE HAVE NOT TOUCHED COMPRESSION....

There seems to b= e a problem with your caps-lock key.
Since it seems that you said that you are using maildir for you= r mail server, it is likely very useful if you do enable even rather mild c= ompression (e.g. lz4) since this will reduce the write work-load and even s= hort files will be stored more efficiently.

Bob
--=_d38b5728c4674c13e58315ae3b5e7d2c-- From nobody Thu Apr 7 08:59:50 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 901611A80E64; Thu, 7 Apr 2022 08:59:53 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYwLc4b9Kz3pb3; Thu, 7 Apr 2022 08:59:52 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id A23F660C6A9; Thu, 7 Apr 2022 10:59:50 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_c2807b764a57b09883b960b84ffbc8e7" Date: Thu, 07 Apr 2022 10:59:50 +0200 From: egoitz@ramattack.net To: mike tancsa Cc: Bob Friesenhahn , freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance Subject: Re: {* 05.00 *}Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> Message-ID: <609373d106c2244a8a2a3e2ca5e6eb73@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KYwLc4b9Kz3pb3 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCVD_TLS_LAST(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24:c]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_c2807b764a57b09883b960b84ffbc8e7 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Mike! Thanks a lot for your comment. I see. As said before, we didn't really enable compression because we just keep the config as FreeBSD leaves by default. Apart from that, having tons of disk space and well... for avoiding the load of compress/decompress... The main reason was it was not enabled by default really and not to have seen a real reason for it.... was not more than that.... I appreciate your comments really :) Cheers, El 2022-04-06 22:43, mike tancsa escribió: > ATENCION > ATENCION > ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > On 4/6/2022 4:18 PM, Bob Friesenhahn wrote: On Wed, 6 Apr 2022, egoitz@ramattack.net wrote: > WE DON'T USE COMPRESSION AS IT'S NOT SET BY DEFAULT. SOME PEOPLE SAY YOU SHOULD HAVE IT ENABLED.... BUT.... JUST FOR AVOID HAVING SOME DATA COMPRESSED SOME OTHER NOT (IN CASE YOU ENABLE AND LATER DISABLE) AND FINALLY FOR AVOID ACCESSING TO INFORMATION WITH DIFFERENT CPU COSTS OF HANDLING... WE HAVE NOT TOUCHED COMPRESSION.... There seems to be a problem with your caps-lock key. Since it seems that you said that you are using maildir for your mail server, it is likely very useful if you do enable even rather mild compression (e.g. lz4) since this will reduce the write work-load and even short files will be stored more efficiently. FYI, a couple of our big zfs mailspools sees a 1.24x and 1.23x compress ratio with lz4. We use Maildir format as well. They are not RELENG_13 so not sure how zstd would fair. ---Mike --=_c2807b764a57b09883b960b84ffbc8e7 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Mike!


Thanks a lot for your comment. I see. As said before, we didn't really e= nable compression because we just keep the config as FreeBSD leaves by defa= ult. Apart from that, having tons of disk space and well... for avoiding th= e load of compress/decompress... The main reason was it was not enabled by = default really and not to have seen a real reason for it.... was not more t= han that....


I appreciate your comments really :)


Cheers,

 


El 2022-04-06 22:43, mike tancsa escribió:

= ATENCION
ATENCION
ATENCION= !!! Este correo se ha enviado desde fuer= a de la organizacion. No pinche en los&n= bsp;enlaces ni abra los adjuntos a no se= r que reconozca el remitente y sepa que&= nbsp;el contenido es seguro.

On 4/6/2022 4:18 PM, Bob = ;Friesenhahn wrote:
On Wed, = ;6 Apr 2022, egoitz@= ramattack.net wrote:

WE DON'T USE COMPRESSION AS IT'S NOT SET BY DEF= AULT. SOME PEOPLE SAY YOU SHOULD HAVE IT ENABLED.... BUT.... JUST FOR AVOID= HAVING SOME DATA COMPRESSED SOME OTHER NOT (IN CASE YOU ENABLE AND LATER D= ISABLE) AND FINALLY FOR AVOID ACCESSING TO INFORMATION WITH DIFFERENT CPU C= OSTS OF HANDLING... WE HAVE NOT TOUCHED COMPRESSION....

There seems to b= e a problem with your caps-lock key.
Since it seems that you said that you are using maildir for you= r mail server, it is likely very useful if you do enable even rather mild c= ompression (e.g. lz4) since this will reduce the write work-load and even s= hort files will be stored more efficiently.
FYI, a couple of our big zfs  mailspools sees a 1.24x and 1.23x compre= ss ratio with lz4.  We use Maildir format as well.  They are not = RELENG_13 so not sure how zstd would fair.

    ---Mike

--=_c2807b764a57b09883b960b84ffbc8e7-- From nobody Thu Apr 7 10:05:42 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 6937A1A9104B; Thu, 7 Apr 2022 10:05:46 +0000 (UTC) (envelope-from se@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYxpf1mgPz4VgL; Thu, 7 Apr 2022 10:05:46 +0000 (UTC) (envelope-from se@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649325946; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=5VyYb2yVup9mr10ClzHwP1P7qCT1D6ac9qjMWMQpYPk=; b=J9AwR7QGgnmSa3A7RyFhNZlz+Dxw2twvndSx8IPIRNsDTVooRRHtoZ7h1+gMQOh0VEb9r1 lV84OzxPP0ykzqwJsFRlsPDVTItjJQK0q1yLpWMK6p1bKkkEgFNLQ81K4P3LUFIiMClYvp +/fme73HUjV1rrWZqai8D1vqbGTfryPa3J+UdAaiwDx2440i0uBtb1g8RxtGbOAoiU42ro q7ZvA62+gwZ2bF8X3LMFFE5pLaKmQi4cDzfATAkWsjMEO74+N7QG0rU1Oj/s8S0YRkWj/M 7G5MkwcilTG0mFsIJx1CbCU/lCrqqNotRBGp2lagrbX7F6pW1AlyW0DViyMAKQ== Received: from [IPV6:2003:cd:5f22:6f00:34ed:cacb:3b28:daf] (p200300cd5f226f0034edcacb3b280daf.dip0.t-ipconnect.de [IPv6:2003:cd:5f22:6f00:34ed:cacb:3b28:daf]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: se/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id 57989B868; Thu, 7 Apr 2022 10:05:45 +0000 (UTC) (envelope-from se@FreeBSD.org) Message-ID: <4ef109e8-bd7b-1398-2bc9-191e261d5c06@FreeBSD.org> Date: Thu, 7 Apr 2022 12:05:42 +0200 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS Content-Language: en-US From: Stefan Esser To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------R8ub0be6TPBsoV0UvSwjPqsv" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649325946; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=5VyYb2yVup9mr10ClzHwP1P7qCT1D6ac9qjMWMQpYPk=; b=RWVkDqf8CQrXjJevQVFL0cShsK6iSZaO1jgE2Bi17Mnlp9ugcTH7ulNWb7xwGAtQUp37P7 w3iX6N9imCTlq1qr1VW74CVMGcRRhHJMty/o6myH3BBue0yalPYyhPeo8tZPooYkkYXV92 TMnvXkls2R8OOcsMCvywiIn1ZKIihEhBaZn1db4MLOUe05VMrCvPXgyBsgE6hrtxLVcJzL jK0v5ylRXmKuP9f2iFcfPVYjbyReNcRsCRifFXziero9bJLodjprmIBgBGhRe0KrQqckia cJgbD/ILyDZGJFRbK9NP8Je+Yy5abeS1DhYGSR8qPgD361vy44hLJEqC8qSOWQ== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1649325946; a=rsa-sha256; cv=none; b=lPlxEMv0z3Ynd/DOBaq3/QIOwZxMQwY6ptua//2UhvaqXxm++Pu4qtZzIYesl0IM0/C9t6 X0vcIGvQVRC0QYFKC1MXPnLTiLaIy8o2CwyLOMqALay/wYBwKj1aa4jTYKPiD5DUNNOMdP 9NbMeoWpU4VjsfaJy4iEk45Rtmv1ndzwr/jp6AcGne1bY8ELaAloPZ+yLTAqK77r+k1TeF hmmTFElyuRL7pk/YgQGpin+ZczqnQeqyCn4TnpR9XxB2afGIP70VXOR/O8tA58kVN96Hda pWpqinjZuhwvX1miJwCsFB6wwdC4vk8QSdFbW0ivMT3aEKdZznYQtV/d63+s9A== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------R8ub0be6TPBsoV0UvSwjPqsv Content-Type: multipart/mixed; boundary="------------MH3UBEk8xxTm08glzsqFwYDD"; protected-headers="v1" From: Stefan Esser To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner Message-ID: <4ef109e8-bd7b-1398-2bc9-191e261d5c06@FreeBSD.org> Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> In-Reply-To: --------------MH3UBEk8xxTm08glzsqFwYDD Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Am 06.04.22 um 23:49 schrieb Stefan Esser: > Am 06.04.22 um 18:34 schrieb egoitz@ramattack.net: >=20 >>> The 870 QVO is specified for 370 full capacity >>> writes, i.e. 370 TB for the 1 TB model. That's still a few hundred >>> GB a day - but only if the write amplification stays in a reasonable >>> range ... >>> =C2=A0 >>> *Well yes... 2880TB in our case....not bad.. isn't it?* >=20 > I assume that 2880 TB is your total storage capacity? That's not too ba= d, in > fact. ;-) I just noticed that this is not the extreme total size of a ZFS pool (should have noticed this while answering late at night ...) And no, a specified life-time of 2880 TB written is not much, it is at the absolute lower end of currently available SSDs at 360 TB per 1 TB of capacity. This is equivalent to 360 total capacity writes, but given the high amount of write amplification that can be assumed to occur in your use case, I'd heavily over-provision a system with such SSDs ... (or rather: strictly avoid them in a non-consumer setting). --------------MH3UBEk8xxTm08glzsqFwYDD-- --------------R8ub0be6TPBsoV0UvSwjPqsv Content-Type: application/pgp-signature; name="OpenPGP_signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="OpenPGP_signature" -----BEGIN PGP SIGNATURE----- wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmJOt3YFAwAAAAAACgkQR+u171r99USG 4AgAqvLdjcqRhYzJHDNPIR+8TN0JKwMia2AI7dwFi/edSRFtjP50d/sbgxhP7xkf6a13LuLypubc UK2ldNCHlczP/9m5+uG2MTb9E24Pc1bpX4o459oFzx7MGRyRksMmMKnNyidiVIx4W/ihfZa0b/5K ICbgTmE79RIc3tX8cn77Kp8W8asjABCE2vgqTLtyJpfuhBSg60flkjMAOoBwsyXDNKvEDWWuR3zU U4kxS00DZ2Cb8EHMmi5yiyH+GqoJWRH55BYl0bw3ppDubbQuEZWJojcMZ2eWFJYcC29LkYvCUDrc YgesfZGPQ9eTk1cbKq01ZOCk4zOsgt2PcmFla759oQ== =GJ82 -----END PGP SIGNATURE----- --------------R8ub0be6TPBsoV0UvSwjPqsv-- From nobody Thu Apr 7 11:25:47 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 1DE191A86A2A; Thu, 7 Apr 2022 11:25:52 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KYzb31Xn5z4mwN; Thu, 7 Apr 2022 11:25:51 +0000 (UTC) (envelope-from mike@sentex.net) Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.16.1/8.16.1) with ESMTPS id 237BPnGI010096 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 7 Apr 2022 07:25:49 -0400 (EDT) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:434:73cd:9d42:28ad] ([IPv6:2607:f3e0:0:4:434:73cd:9d42:28ad]) by pyroxene2a.sentex.ca (8.16.1/8.15.2) with ESMTPS id 237BPlhB073256 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Thu, 7 Apr 2022 07:25:48 -0400 (EDT) (envelope-from mike@sentex.net) Message-ID: Date: Thu, 7 Apr 2022 07:25:47 -0400 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: {* 05.00 *}Re: Re: Desperate with 870 QVO and ZFS Content-Language: en-US To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> <609373d106c2244a8a2a3e2ca5e6eb73@ramattack.net> From: mike tancsa In-Reply-To: <609373d106c2244a8a2a3e2ca5e6eb73@ramattack.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 X-Rspamd-Queue-Id: 4KYzb31Xn5z4mwN X-Spamd-Bar: -- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net X-Spamd-Result: default: False [-2.75 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-0.35)[-0.349]; FREEFALL_USER(0.00)[mike]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; FROM_HAS_DN(0.00)[]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[sentex.net]; NEURAL_HAM_LONG(-1.00)[-1.000]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; RCVD_TLS_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received] X-ThisMailContainsUnwantedMimeParts: N On 4/7/2022 4:59 AM, egoitz@ramattack.net wrote: > > Hi Mike! > > Thanks a lot for your comment. I see. As said before, we didn't really > enable compression because we just keep the config as FreeBSD leaves > by default. Apart from that, having tons of disk space and well... for > avoiding the load of compress/decompress... The main reason was it was > not enabled by default really and not to have seen a real reason for > it.... was not more than that....I appreciate your comments really :) Hi,     With respect to compression, I think there is a sweet spot somewhere, where compression makes things faster if your disk IO is the limiting factor and you have spare CPU capacity.  I have a separate 13.x zfs server with ztsd enabled and I get compression rations of 15:1 as it stores a lot of giant JSON txt files. Think of the extreme case where you do something like dd if=/dev/zero of=/tank/junk.bin bs=1m count=10000 as this is a 20G file that takes just a few hundred bytes of write IO on a compressed system. Obviously, as the compress ratio reduces in the real world the benefits become less.  Where that diminishing return is, not sure.  But something to keep in mind     ---Mike From nobody Thu Apr 7 12:30:47 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 679491A9AAC5; Thu, 7 Apr 2022 12:31:00 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu01208b.smtpx.saremail.com (cu01208b.smtpx.saremail.com [195.16.151.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZ12B28vFz3M2k; Thu, 7 Apr 2022 12:30:57 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend01.sarenet.es (Postfix) with ESMTPA id 35B5A60C4C4; Thu, 7 Apr 2022 14:30:48 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_ccf36dab3b44229808a0a46435736314" Date: Thu, 07 Apr 2022 14:30:47 +0200 From: egoitz@ramattack.net To: Stefan Esser Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner Subject: Re: Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> Message-ID: X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KZ12B28vFz3M2k X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.151.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.151.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_ccf36dab3b44229808a0a46435736314 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Stefan, An extremely interesting answer and email. Extremely thankful for all your deep explatanations...... They are like gold for us really.... I answer below and in blue bold for better distinction between your lines and mine ones... El 2022-04-06 23:49, Stefan Esser escribió: > ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > Am 06.04.22 um 18:34 schrieb egoitz@ramattack.net: > >> Hi Stefan! >> >> Thank you so much for your answer!!. I do answer below in green bold for instance... for a better distinction.... >> >> Very thankful for all your comments Stefan!!! :) :) :) >> >> Cheers!! > > Hi, > > glad to hear that it is useful information - I'll add comments below ... > > EXTREMELY HELPFUL INFORMATION REALLY! THANK YOU SO MUCH STEFFAN REALLY. VERY VERY THANKFUL FOR YOUR NICE HELP!. > > El 2022-04-06 17:43, Stefan Esser escribió: > > Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net: Hi Rainer! > > Thank you so much for your help :) :) > > Well I assume they are in a datacenter and should not be a power outage.... > > About dataset size... yes... our ones are big... they can be 3-4 TB easily each > dataset..... > > We bought them, because as they are for mailboxes and mailboxes grow and > grow.... for having space for hosting them... > Which mailbox format (e.g. mbox, maildir, ...) do you use? > > I'M RUNNING CYRUS IMAP SO SORT OF MAILDIR... TOO MANY LITTLE FILES NORMALLY..... SOMETIMES DIRECTORIES WITH TONS OF LITTLE FILES.... Assuming that many mails are much smaller than the erase block size of the SSD, this may cause issues. (You may know the following ...) For example, if you have message sizes of 8 KB and an erase block size of 64 KB (just guessing), then 8 mails will be in an erase block. If half the mails are deleted, then the erase block will still occupy 64 KB, but only hold 32 KB of useful data (and the SSD will only be aware of this fact if TRIM has signaled which data is no longer relevant). The SSD will copy several partially filled erase blocks together in a smaller number of free blocks, which then are fully utilized. Later deletions will repeat this game, and your data will be copied multiple times until it has aged (and the user is less likely to delete further messages). This leads to "write amplification" - data is internally moved around and thus written multiple times. STEFAN!! YOU ARE NICE!! I THINK THIS COULD EXPLAIN ALL OUR PROBLEM. SO, WHY WE ARE HAVING THE MOST RANDOMNESS IN OUR PERFORMANCE DEGRADATION AND THAT DOES NOT NECESSARILY HAS TO MATCH WITH THE MOST IO PEAK HOURS... THAT I COULD CAUSE THAT PERFORMANCE DEGRADATION JUST BY DELETING A COUPLE OF HUGE (PERHAPS 200.000 MAILS) MAIL FOLDERS IN A MIDDLE TRAFFIC HOUR TIME!! THE PROBLEM IS THAT BY WHAT I KNOW, ERASE BLOCK SIZE OF AN SSD DISK IS SOMETHING FIXED IN THE DISK FIRMWARE. I DON'T REALLY KNOW IF PERHAPS IT COULD BE MODIFIED WITH SAMSUNG MAGICIAN OR THOSE KIND OF TOOL OF SAMSUNG.... ELSE I DON'T REALLY SEE THE MANNER OF IMPROVING IT... BECAUSE APART FROM THAT, YOU ARE DELETING A FILE IN RAIDZ-2 ARRAY... NO JUST IN A DISK... I ASSUME ALIGNING CHUNK SIZE, WITH RECORD SIZE AND WITH THE "SECRET" ERASE SIZE OF THE SSD, PERHAPS COULD BE SLIGHTLY COMPENSATED?. Larger mails are less of an issue since they span multiple erase blocks, which will be completely freed when such a message is deleted. I SEE I SEE STEFAN... Samsung has a lot of experience and generally good strategies to deal with such a situation, but SSDs specified for use in storage systems might be much better suited for that kind of usage profile. YES... AND THE DISKS FOR OUR PURPOSE... PERHAPS WEREN'T QVOS.... > We knew they had some speed issues, but those speed issues, we thought (as > Samsung explains in the QVO site) they started after exceeding the speeding > buffer this disks have. We though that meanwhile you didn't exceed it's > capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps > we were wrong?. > These drives are meant for small loads in a typical PC use case, > i.e. some installations of software in the few GB range, else only > files of a few MB being written, perhaps an import of media files > that range from tens to a few hundred MB at a time, but less often > than once a day. > > WE MOVE, YOU KNOW... LOTS OF LITTLE FILES... AND LOT'S OF DIFFERENT CONCURRENT MODIFICATIONS BY 1500-2000 CONCURRENT IMAP CONNECTIONS WE HAVE... I do not expect the read load to be a problem (except possibly when the SSD is moving data from SLC to QLC blocks, but even then reads will get priority). But writes and trims might very well overwhelm the SSD, especially when its getting full. Keeping a part of the SSD unused (excluded from the partitions created) will lead to a large pool of unused blocks. This will reduce the write amplification - there are many free blocks in the "unpartitioned part" of the SSD, and thus there is less urgency to compact partially filled blocks. (E.g. if you include only 3/4 of the SSD capacity in a partition used for the ZPOOL, then 1/4 of each erase block could be free due to deletions/TRIM without any compactions required to hold all this data.) Keeping a significant percentage of the SSD unallocated is a good strategy to improve its performance and resilience. WELL, WE HAVE ALLOCATED ALL THE DISK SPACE... BUT NOT USED... JUST ALLOCATED.... YOU KNOW... WE DO A ZPOOL CREATE WITH THE WHOLE DISKS..... >> As the SSD fills, the space available for the single level write >> cache gets smaller >> >> THE SINGLE LEVEL WRITE CACHE IS THE CACHE THESE SSD DRIVERS HAVE, FOR COMPENSATING THE SPEED ISSUES THEY HAVE DUE TO USING QLC MEMORY?. DO YOU REFER TO THAT?. SORRY I DON'T UNDERSTAND WELL THIS PARAGRAPH. Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as 24 GB of data in QLC mode. OK, TRUE.... YES.... A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (600 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells). AHH! YOU MEAN THAT SLC CAPACITY FOR SPEEDING UP THE QLC DISKS, IS OBTAINED FROM EACH SINGLE LAYER OF THE QLC?. Therefore, the fraction of the cells used as an SLC cache is reduced when it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cells). SORRY I DON'T GET THIS LAST SENTENCE... DON'T UNDERSTAND IT BECAUSE I DON'T REALLY KNOW THE MEANING OF TN... BUT I THINK I'M GETTING THE IDEA IF YOU SAY THAT EACH QLC LAYER, HAS IT'S OWN SLC CACHE OBTAINED FROM THE DISK SPACE AVAIABLE FOR EACH QLC LAYER.... And with less SLC cells available for short term storage of data the probability of data being copied to QLC cells before the irrelevant messages have been deleted is significantly increased. And that will again lead to many more blocks with "holes" (deleted messages) in them, which then need to be copied possibly multiple times to compact them. IF I CORRECT ABOVE, I THINK I GOT THE IDEA YES.... >> (on many SSDs, I have no numbers for this >> particular device), and thus the amount of data that can be >> written at single cell speed shrinks as the SSD gets full. >> >> I have just looked up the size of the SLC cache, it is specified >> to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB >> version, smaller models will have a smaller SLC cache). >> >> ASSUMING YOU WERE TALKING ABOUT THE CACHE FOR COMPENSATING SPEED WE PREVIOUSLY COMMENTED, I SHOULD SAY THESE ARE THE 870 QVO BUT THE 8TB VERSION. SO THEY SHOULD HAVE THE BIGGEST CACHE FOR COMPENSATING THE SPEED ISSUES... I have looked up the data: the larger versions of the 870 QVO have the same SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB more if there are enough free blocks. OURS ONE IS THE 8TB MODEL SO I ASSUME IT COULD HAVE BIGGER LIMITS. THE DISKS ARE MOSTLY EMPTY, REALLY.... SO... FOR INSTANCE.... ZPOOL LIST NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT ROOT_DATASET 448G 2.29G 446G - - 1% 0% 1.00X ONLINE - MAIL_DATASET 58.2T 11.8T 46.4T - - 26% 20% 1.00X ONLINE - I SUPPOSE FRAGMENTATION AFFECTS TOO.... >> But after writing those few GB at a speed of some 500 MB/s (i.e. >> after 12 to 150 seconds), the drive will need several minutes to >> transfer those writes to the quad-level cells, and will operate >> at a fraction of the nominal performance during that time. >> (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the >> 2 TB model.) >> >> WELL WE ARE IN THE 8TB MODEL. I THINK I HAVE UNDERSTOOD WHAT YOU WROTE IN PREVIOUS PARAGRAPH. YOU SAID THEY CAN BE FAST BUT NOT CONSTANTLY, BECAUSE LATER THEY HAVE TO WRITE ALL THAT TO THEIR PERPETUAL STORAGE FROM THE CACHE. AND THAT'S SLOW. AM I WRONG?. EVEN IN THE 8TB MODEL YOU THINK STEFAN?. The controller in the SSD supports a given number of channels (e.g 4), each of which can access a Flash chip independently of the others. Small SSDs often have less Flash chips than there are channels (and thus a lower throughput, especially for writes), but the larger models often have more chips than channels and thus the performance is capped. THIS IS TOTALLY LOGICAL. IF A QVO DISK WOULD OUTPERFORM BEST OR SIMILAR THAN AN INTEL WITHOUT CONSEQUENCES.... WHO WAS GOING TO BUY A EXPENSIVE INTEL ENTERPRISE?. In the case of the 870 QVO, the controller supports 8 channels, which allows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has only 4 Flash chips and is thus limited to 80 MB/s in that situation, while the larger versions have 8, 16, or 32 chips. But due to the limited number of channels, the write rate is limited to 160 MB/s even for the 8 TB model. TOTALLY LOGICAL STEFAN... If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in this limit. >> THE MAIN PROBLEM WE ARE FACING IS THAT IN SOME PEAK MOMENTS, WHEN THE MACHINE SERVES CONNECTIONS FOR ALL THE INSTANCES IT HAS, AND ONLY AS SAID IN SOME PEAK MOMENTS... LIKE THE 09AM OR THE 11AM.... IT SEEMS THE MACHINE BECOMES SLOWER... AND LIKE IF THE DISKS WEREN'T ABLE TO SERVE ALL THEY HAVE TO SERVE.... IN THESE MOMENTS, NO BIG FILES ARE MOVED... BUT AS WE HAVE 1800-2000 CONCURRENT IMAP CONNECTIONS... NORMALLY THEY ARE DOING EACH ONE... LITTLE CHANGES IN THEIR MAILBOX. DO YOU THINK PERHAPS THIS DISKS THEN ARE NOT APPROPRIATE FOR THIS KIND OF USAGE?- I'd guess that the drives get into a state in which they have to recycle lots of partially free blocks (i.e. perform kind of a garbage collection) and then three kinds of operations are competing with each other: * reads (generally prioritized) * writes (filling the SLC cache up to its maximum size) * compactions of partially filled blocks (required to make free blocks available for re-use) Writes can only proceed if there are sufficient free blocks, which on a filled SSD with partially filled erase blocks means that operations of type 3. need to be performed with priority to not stall all writes. My assumption is that this is what you are observing under peak load. IT COULD BE ALTHOUGH THE DISKS ARE NOT FILLED.... THE POOL ARE AT 20 OR 30% OF CAPACITY AND FRAGMENTATION FROM 20%-30% (AS ZPOOL LIST STATES). >> And cheap SSDs often have no RAM cache (not checked, but I'd be >> surprised if the QVO had one) and thus cannot keep bookkeeping date >> in such a cache, further limiting the performance under load. >> >> THIS BROCHURE (HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/BROCHURE/870_SERIES_BROCHURE.PDF AND THE DATASHEET HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/DATA-SHEET/SAMSUNG_SSD_870_QVO_DATA_SHEET_REV1.1.PDF) SAIS IF I HAVE READ PROPERLY, THE 8TB DRIVE HAS 8GB OF RAM?. I ASSUME THAT IS WHAT THEY CALL THE TURBO WRITE CACHE?. No, the turbo write cache consists of the cells used in SLC mode (which can be any cells, not only cells in a specific area of the flash chip). I SEE I SEE.... The RAM is needed for fast lookup of the position of data for reads and of free blocks for writes. OUR ONES... SEEM TO HAVE 8GB LPDDR4 OF RAM.... AS DATASHEET STATES.... There is no simple relation between SSD "block number" (in the sense of a disk block on some track of a magnetic disk) and its storage location on the Flash chip. If an existing "data block" (what would be a sector on a hard disk drive) is overwritten, it is instead written at the end of an "open" erase block, and a pointer from that "block number" to the location on the chip is stored in an index. This index is written to Flash storage and could be read from it, but it is much faster to have a RAM with these pointers that can be accessed independently of the Flash chips. This RAM is required for high transaction rates (especially random reads), but it does not really help speed up writes. I SEE... I SEE.... I GOT IT... >> And the resilience (max. amount of data written over its lifetime) >> is also quite low - I hope those drives are used in some kind of >> RAID configuration. >> >> YEP WE USE RAIDZ-2 Makes sense ... But you know that you multiply the amount of data written due to the redundancy. If a single 8 KB block is written, for example, 3 * 8 KB will written if you take the 2 redundant copies into account. I SEE I SEE.... >> The 870 QVO is specified for 370 full capacity >> writes, i.e. 370 TB for the 1 TB model. That's still a few hundred >> GB a day - but only if the write amplification stays in a reasonable >> range ... >> >> WELL YES... 2880TB IN OUR CASE....NOT BAD.. ISN'T IT? I assume that 2880 TB is your total storage capacity? That's not too bad, in fact. ;-) NO... THE TOTAL NUMBER OF WRITES YOU CAN DO....BEFORE THE DISK "BREAKS".... LOL :) :) ... WE ARE HAVING STORAGES OF 50TB DUE TO 8 DISKS OF 8TB IN RAIDZ-2.... This would be 360 * 8 TB ... Even at 160 MB/s per 8 TB SSD this would allow for more than 50 GB/s of write throughput (if all writes were evenly distributed). Taking all odds into account, I'd guess that at least 10 GB/s can be continuously written (if supported by the CPUs and controllers). But this may not be true if the drive is simultaneously reading, trimming, and writing ... I SEE.... IT'S EXTREMELY MISLEADING YOU KNOW... BECAUSE... YOU CAN COPY FIVE MAILBOXES OF 50GB CONCURRENTLY FOR INSTANCE.... AND YOU FLOOD A GIGABIT INTERFACE COPYING (OBVIOUSLY BECAUSE DISKS CAN KEEP THAT THROUGHPUT)... BUT LATER.... YOU SEE... YOU ARE IN AN HOUR THAT YESTERDAY, AND EVEN 4 DAYS BEFORE YOU HAVE NOT HAD ANY ISSUES... AND THAT DAY... YOU SEE THE COMMENTED ISSUE... EVEN NOT BEING EXACTLY AT A PEAK HOUR (PERHAPS IS TWO HOURS LATER THE PEAK HOUR EVEN)... OR... BUT I WASN'T NOTICING ABOUT ALL THINGS YOU SAY IN THIS EMAIL.... I have seen advice to not use compression in a high load scenario in some other reply. I tend to disagree: Since you seem to be limited when the SLC cache is exhausted, you should get better performance if you compress your data. I have found that zstd-2 works well for me (giving a significant overall reduction of size at reasonable additional CPU load). Since ZFS allows to switch compressions algorithms at any time, you can experiment with different algorithms and levels. I SEE... YOU SAY COMPRESSION SHOULD BE ENABLED.... THE MAIN REASON BECAUSE WE HAVE NOT ENABLED IT YET, IS FOR KEEPING THE SYSTEM THE MOST NEAR POSSIBLE TO CONFIG DEFAULTS... YOU KNOW... FOR LATER BEING ABLE TO ASK IN THIS MAILING LISTS IF WE HAVE AN ISSUE... BECAUSE YOU KNOW... IT WOULD BE FAR MORE EASIER TO ASK ABOUT SOMETHING STRANGE YOU ARE SEEING WHEN THAT STRANGE THING IS NEAR TO A WELL TESTED CONFIG, LIKE THE CONFIG BY DEFAULT.... BUT NOW YOU SAY STEFAN... IF YOU SWITCH BETWEEN COMPRESSION ALGORITHMS YOU WILL END UP WITH A MIX OF DIFFERENT FILES COMPRESSED IN A DIFFERENT MANNER... THAT IS NOT A BIT DISASTER LATER?. DOESN'T AFFECT PERFORMANCE IN SOME MANNER?. One advantage of ZFS compression is that it applies to the ARC, too. And a compression factor of 2 should easily be achieved when storing mail (not for .docx, .pdf, .jpg files though). Having more data in the ARC will reduce the read pressure on the SSDs and will give them more cycles for garbage collections (which are performed in the background and required to always have a sufficient reserve of free flash blocks for writes). WE WOULD USE I ASSUME THE LZ4... WHICH IS THE LESS "EXPENSIVE" COMPRESSION ALGORITHM FOR THE CPU... AND I ASSUME TOO FOR AVOIDING DELAY ACCESSING DATA... DO YOU RECOMMEND ANOTHER ONE?. DO YOU ALWAYS RECOMMEND COMPRESSION THEN?. I'd give it a try - and if it reduces your storage requirements by 10% only, then keep 10% of each SSD unused (not assigned to any partition). That will greatly improve the resilience of your SSDs, reduce the write-amplification, will allow the SLC cache to stay at its large value, and may make a large difference to the effective performance under high load. BUT WHEN YOU ENABLE COMPRESSION... ONLY GETS COMPRESSED THE NEW DATA MODIFIED OR ENTERED. AM I WRONG?. BY THE WAY, WE HAVE MORE OR LESS 1/4 OF EACH DISK USED (12 TB ALLOCATED IN A POLL STATED BY ZPOOL LIST, DIVIDED BETWEEN 8 DISKS OF 8TB...)... DO YOU THINK WE COULD BE SUFFERING ON WRITE AMPLIFICATION AND SO... HAVING A SO LITTLE DISK SPACE USED IN EACH DISK?. Regards, STefan HEY MATE, YOUR MAIL IS INCREDIBLE. IT HAS HELPED AS A LOT. CAN WE INVITE YOU A CUP OF COFFEE OR A BEER THROUGH PAYPAL OR SIMILAR?. CAN I HELP YOU IN SOME MANNER?. CHEERS! --=_ccf36dab3b44229808a0a46435736314 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Stefan,


An extremely interesting answer and email. Extremely thankful for all yo= ur deep explatanations...... They are like gold for us really....


I answer below and in blue bold for better distinction between your line= s and mine ones...

 


El 2022-04-06 23:49, Stefan Esser escribió:


ATENCION: Este correo se ha enviado = desde fuera de la organización. No pinche en los enlaces ni abra los= adjuntos a no ser que reconozca el remitente y sepa que el contenido es se= guro.

Am 06.04.22 um 18:34 schrieb egoitz@ramattack.net:

Hi Stefan!

Thank you so much for your answer!!. I do answer below in green bold for= instance... for a better distinction....

Very thankful for all your comments Stefan!!! :) :) :)

Cheers!!

Hi,

glad to hear that it is useful information - I'll add comments below .= =2E.


Extremely helpful information re= ally! Thank you so much Steffan really. Very very thankful for your nice he= lp!.


El 2022-04-06 17:43, Stefan Esser escribió:



Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net:
Hi Rainer!

Thank you so much for your help :) :)
=
Well I assume they are in a datacenter and should not be a power ou= tage....

About dataset size... yes... our ones are big... they= can be 3-4 TB easily each
dataset.....

We bought them, = because as they are for mailboxes and mailboxes grow and
grow.... for= having space for hosting them...

Which mailbox format (e.g. mbox, maildir, ...) do you use?
 
I'm running Cyrus imap so sort of = Maildir... too many little files normally..... Sometimes directories with t= ons of little files....

Assuming that many mails are much smaller than the erase block size of t= he SSD, this may cause issues. (You may know the following ...)

For example, if you have message sizes of 8 KB and an erase block size o= f 64 KB (just guessing), then 8 mails will be in an erase block. If half th= e mails are deleted, then the erase block will still occupy 64 KB, but only= hold 32 KB of useful data (and the SSD will only be aware of this fact if = TRIM has signaled which data is no longer relevant). The SSD will copy seve= ral partially filled erase blocks together in a smaller number of free bloc= ks, which then are fully utilized. Later deletions will repeat this game, a= nd your data will be copied multiple times until it has aged (and the user = is less likely to delete further messages). This leads to "write amplificat= ion" - data is internally moved around and thus written multiple times.


Stefan!! you are nice!! I think = this could explain all our problem. So, why we are having the most randomne= ss in our performance degradation and that does not necessarily has to matc= h with the most io peak hours... That I could cause that performance degrad= ation just by deleting a couple of huge (perhaps 200.000 mails) mail folder= s in a middle traffic hour time!!


The problem is that by what I kn= ow, erase block size of an SSD disk is something fixed in the disk firmware= =2E I don't really know if perhaps it could be modified with Samsung magici= an or those kind of tool of Samsung.... else I don't really see the manner = of improving it... because apart from that, you are deleting a file in raid= z-2 array... no just in a disk... I assume aligning chunk size, with record= size and with the "secret" erase size of the ssd, perhaps could be slightl= y compensated?.

Larger mails are less of an issue since they span multiple erase blocks,= which will be completely freed when such a message is deleted.

I see I see Stefan...

Samsung has a lot of experience and generally good strategies to deal wi= th such a situation, but SSDs specified for use in storage systems might be= much better suited for that kind of usage profile.

Yes... and the disks for our pur= pose... perhaps weren't QVOs....


We knew they had some speed issues, but those speed issues, we thou= ght (as
Samsung explains in the QVO site) they started after exceedin= g the speeding
buffer this disks have. We though that meanwhile you d= idn't exceed it's
capacity (the capacity of the speeding buffer) no s= peed problem arises. Perhaps
we were wrong?.

These drives are meant for small loads in a typical PC use case,
i.e. some installations of software in the few GB range, else only
= files of a few MB being written, perhaps an import of media files
th= at range from tens to a few hundred MB at a time, but less often
than= once a day.
 
We move, you know... lots of littl= e files... and lot's of different concurrent modifications by 1500-2000 con= current imap connections we have...

I do not expect the read load to be a problem (except possibly when the = SSD is moving data from SLC to QLC blocks, but even then reads will get pri= ority). But writes and trims might very well overwhelm the SSD, especially = when its getting full. Keeping a part of the SSD unused (excluded from the = partitions created) will lead to a large pool of unused blocks. This will r= educe the write amplification - there are many free blocks in the "unpartit= ioned part" of the SSD, and thus there is less urgency to compact partially= filled blocks. (E.g. if you include only 3/4 of the SSD capacity in a part= ition used for the ZPOOL, then 1/4 of each erase block could be free due to= deletions/TRIM without any compactions required to hold all this data.)

Keeping a significant percentage of the SSD unallocated is a good strate= gy to improve its performance and resilience.

Well, we have allocated all the = disk space... but not used... just allocated.... you know... we do a zpool = create with the whole disks.....

As the SSD fills, the space available for the single level write
cac= he gets smaller
 
The single level write cache is th= e cache these ssd drivers have, for compensating the speed issues they have= due to using qlc memory?. Do you refer to that?. Sorry I don't understand = well this paragraph.

Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC = cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as= 24 GB of data in QLC mode.

Ok, true.... yes....

A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (60= 0 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells).

Ahh! you mean that SLC capacity = for speeding up the QLC disks, is obtained from each single layer of the QL= C?.

Therefore, the fraction of the cells used as an SLC cache is reduced whe= n it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cell= s).

Sorry I don't get this last sent= ence... don't understand it because I don't really know the meaning of tn= =2E..

but I think I'm getting the idea= if you say that each QLC layer, has it's own SLC cache obtained from the d= isk space avaiable for each QLC layer....

And with less SLC cells available for short term storage of data the pro= bability of data being copied to QLC cells before the irrelevant messages h= ave been deleted is significantly increased. And that will again lead to ma= ny more blocks with "holes" (deleted messages) in them, which then need to = be copied possibly multiple times to compact them.

If I correct above, I think I go= t the idea yes....

(on many SSDs, I have no numbers for this
particular device), and thus the amount of data that can be
written = at single cell speed shrinks as the SSD gets full.
 


I have just looked up the size of the SLC cache, it is speci= fied
to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 T= B
version, smaller models will have a smaller SLC cache).
 
Assuming you were talking about th= e cache for compensating speed we previously commented, I should say these = are the 870 QVO but the 8TB version. So they should have the biggest cache = for compensating the speed issues...

I have looked up the data: the larger versions of the 870 QVO have the s= ame SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB= more if there are enough free blocks.

Ours one is the 8TB model so I a= ssume it could have bigger limits. The disks are mostly empty, really.... s= o... for instance....

zpool list
= NAME     &= nbsp;       SIZE  ALLOC   FREE=   CKPOINT  EXPANDSZ   FRAG    CAP  = DEDUP  HEALTH  ALTROOT
root_dataset        = ;     448G  2.29G   446G  &nbs= p;     -        = ; -     1%     0%  1.00x = ONLINE  -
I suppose fragmentation affects = too....

But after writing those few GB at a speed of some 500 MB/s (i.e.
aft= er 12 to 150 seconds), the drive will need several minutes to
transfe= r those writes to the quad-level cells, and will operate
at a fractio= n of the nominal performance during that time.
(QLC writes max out at= 80 MB/s for the 1 TB model, 160 MB/s for the
2 TB model.)
 
Well we are in the 8TB model. I th= ink I have understood what you wrote in previous paragraph. You said they c= an be fast but not constantly, because later they have to write all that to= their perpetual storage from the cache. And that's slow. Am I wrong?. Even= in the 8TB model you think Stefan?.

The controller in the SSD supports a given number of channels (e.g 4), e= ach of which can access a Flash chip independently of the others. Small SSD= s often have less Flash chips than there are channels (and thus a lower thr= oughput, especially for writes), but the larger models often have more chip= s than channels and thus the performance is capped.

This is totally logical. If a QV= O disk would outperform best or similar than an Intel without consequences= =2E... who was going to buy a expensive Intel enterprise?.<= /p>

In the case of the 870 QVO, the controller supports 8 channels, which al= lows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has= only 4 Flash chips and is thus limited to 80 MB/s in that situation, while= the larger versions have 8, 16, or 32 chips. But due to the limited number= of channels, the write rate is limited to 160 MB/s even for the 8 TB model= =2E

Totally logical Stefan...=

If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in thi= s limit.

The main problem we are facing is = that in some peak moments, when the machine serves connections for all the = instances it has, and only as said in some peak moments... like the 09am or= the 11am.... it seems the machine becomes slower... and like if the disks = weren't able to serve all they have to serve.... In these moments, no big f= iles are moved... but as we have 1800-2000 concurrent imap connections... n= ormally they are doing each one... little changes in their mailbox. Do you = think perhaps this disks then are not appropriate for this kind of usage?-<= /strong>

I'd guess that the drives get into a state in which they have to recycle= lots of partially free blocks (i.e. perform kind of a garbage collection) = and then three kinds of operations are competing with each other:

  1. reads (generally prioritized)
  2. writes (filling the SLC cache up to its maximum size)
  3. compactions of partially filled blocks (required to make free blocks av= ailable for re-use)

Writes can only proceed if there are sufficient free blocks, which on a = filled SSD with partially filled erase blocks means that operations of type= 3. need to be performed with priority to not stall all writes.

My assumption is that this is what you are observing under peak load.

It could be although the disks a= re not filled.... the pool are at 20 or 30% of capacity and fragmentation f= rom 20%-30% (as zpool list states).

And cheap SSDs often have no RAM cache (not checked, but I'd be
surp= rised if the QVO had one) and thus cannot keep bookkeeping date
in su= ch a cache, further limiting the performance under load.
 
This brochure (https://semiconductor.samsung.com/resources/brochu= re/870_Series_Brochure.pdf and the datasheet https://semiconductor.samsung.com/resources/data-sheet/Samsung_S= SD_870_QVO_Data_Sheet_Rev1.1.pdf) sais if I have read properly, the 8TB= drive has 8GB of ram?. I assume that is what they call the turbo write cac= he?.

No, the turbo write cache consists of the cells used in SLC mode (which = can be any cells, not only cells in a specific area of the flash chip).

I see I see....<= /p>

The RAM is needed for fast lookup of the position of data for reads and = of free blocks for writes.

Our ones... seem to have 8GB LPD= DR4 of ram.... as datasheet states....

There is no simple relation between SSD "block number" (in the sense of = a disk block on some track of a magnetic disk) and its storage location on = the Flash chip. If an existing "data block" (what would be a sector on a ha= rd disk drive) is overwritten, it is instead written at the end of an "open= " erase block, and a pointer from that "block number" to the location on th= e chip is stored in an index. This index is written to Flash storage and co= uld be read from it, but it is much faster to have a RAM with these pointer= s that can be accessed independently of the Flash chips. This RAM is requir= ed for high transaction rates (especially random reads), but it does not re= ally help speed up writes.

I see... I see.... I got it...


And the resilience (max. amount of data written over its lifetime)
i= s also quite low - I hope those drives are used in some kind of
RAID = configuration.
 
Yep we use raidz-2=

Makes sense ... But you know that you multiply the amount of data writte= n due to the redundancy.

If a single 8 KB block is written, for example, 3 * 8 KB will written if= you take the 2 redundant copies into account.

I see I see....<= /p>


The 870 QVO is specified for 370 full capacity
writes, i.e. 370 TB f= or the 1 TB model. That's still a few hundred
GB a day - but only if = the write amplification stays in a reasonable
range ...
 
Well yes... 2880TB in our case..= =2E.not bad.. isn't it?

I assume that 2880 TB is your total storage capacity? That's not too bad= , in fact. ;-)

No... the total number of writes= you can do....before the disk "breaks"....


lol :) :)  ... we are havin= g storages of 50TB due to 8 disks of 8TB in raidz-2....


This would be 360 * 8 TB ...

Even at 160 MB/s per 8 TB SSD this would allow for more than 50 GB/s of = write throughput (if all writes were evenly distributed).

Taking all odds into account, I'd guess that at least 10 GB/s can be con= tinuously written (if supported by the CPUs and controllers).

But this may not be true if the drive is simultaneously reading, trimmin= g, and writing ...

I see.... It's extremely mislead= ing you know... because... you can copy five mailboxes of 50GB concurrently= for instance.... and you flood a gigabit interface copying (obviously beca= use disks can keep that throughput)... but later.... you see... you are in = an hour that yesterday, and even 4 days before you have not had any issues= =2E.. and that day... you see the commented issue... even not being exactly= at a peak hour (perhaps is two hours later the peak hour even)... or... bu= t I wasn't noticing about all things you say in this email....

I have seen advice to not use compression in a high load scenario in som= e other reply.

I tend to disagree: Since you seem to be limited when the SLC cache is e= xhausted, you should get better performance if you compress your data. I ha= ve found that zstd-2 works well for me (giving a significant overall reduct= ion of size at reasonable additional CPU load). Since ZFS allows to switch = compressions algorithms at any time, you can experiment with different algo= rithms and levels.

I see... you say compression sho= uld be enabled.... The main reason because we have not enabled it yet, is f= or keeping the system the most near possible to config defaults... you know= =2E.. for later being able to ask in this mailing lists if we have an issue= =2E.. because you know... it would be far more easier to ask about somethin= g strange you are seeing when that strange thing is near to a well tested c= onfig, like the config by default....

But now you say Stefan... if you= switch between compression algorithms you will end up with a mix of differ= ent files compressed in a different manner... that is not a bit disaster la= ter?. Doesn't affect performance in some manner?.

One advantage of ZFS compression is that it applies to the ARC, too. And= a compression factor of 2 should easily be achieved when storing mail (not= for .docx, .pdf, .jpg files though). Having more data in the ARC will redu= ce the read pressure on the SSDs and will give them more cycles for garbage= collections (which are performed in the background and required to always = have a sufficient reserve of free flash blocks for writes).

We would use I assume the lz4.= =2E. which is the less "expensive" compression algorithm for the CPU... and= I assume too for avoiding delay accessing data... do you recommend another= one?. Do you always recommend compression then?.

I'd give it a try - and if it reduces your storage requirements by 10% o= nly, then keep 10% of each SSD unused (not assigned to any partition). That= will greatly improve the resilience of your SSDs, reduce the write-amplifi= cation, will allow the SLC cache to stay at its large value, and may make a= large difference to the effective performance under high load.

But when you enable compression= =2E.. only gets compressed the new data modified or entered. Am I wrong?.

By the way, we have more or less= 1/4 of each disk used (12 TB allocated in a poll stated by zpool list, div= ided between 8 disks of 8TB...)... do you think we could be suffering on wr= ite amplification and so... having a so little disk space used in each disk= ?.

Regards, STefan

Hey mate, your mail is incredibl= e. It has helped as a lot. Can we invite you a cup of coffee or a beer thro= ugh Paypal or similar?. Can I help you in some manner?.


Cheers!


--=_ccf36dab3b44229808a0a46435736314-- From nobody Thu Apr 7 13:39:58 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 003991A8A500; Thu, 7 Apr 2022 13:40:04 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu01208b.smtpx.saremail.com (cu01208b.smtpx.saremail.com [195.16.151.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZ2Yt5pt4z4TQY; Thu, 7 Apr 2022 13:40:01 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend01.sarenet.es (Postfix) with ESMTPA id AA01060C45C; Thu, 7 Apr 2022 15:39:58 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_cc796372be3f66b49dc427201012caa6" Date: Thu, 07 Apr 2022 15:39:58 +0200 From: egoitz@ramattack.net To: Stefan Esser Cc: mike tancsa , Bob Friesenhahn , freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance Subject: Re: {* 05.00 *}Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: <0702dc56-28ba-7e99-d599-1036634d79e3@FreeBSD.org> References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> <0702dc56-28ba-7e99-d599-1036634d79e3@FreeBSD.org> Message-ID: X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KZ2Yt5pt4z4TQY X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.151.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.151.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; RCPT_COUNT_FIVE(0.00)[6]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_cc796372be3f66b49dc427201012caa6 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Stefan!!, Thank you so much. I can't be more thankful. I answer below in bold blue for instance... for better seeing my answers.... El 2022-04-06 23:59, Stefan Esser escribió: > I have got much better compression at same or less load by use of zstd-2 > compared to lz4. > > I ASSUME PEOPLE WOULD NORMALLY USES LZOP DUE TO IT'S KNOWN LIGHT CPU LOAD NEEDED... > I TAKE NOT TOO OF ZSTD-2.... > > Perhaps not typical, since this is a dovecot mdbox formatted mail pool > holding mostly plain text messages without large attachments: > > $ df /var/mdbox > Filesystem 1K-blocks Used Avail Capacity Mounted on > system/var/mdbox 7234048944 9170888 7224878056 0% /var/mdbox > > $ zfs get compression,compressratio,used,logicalused system/var/mdbox > NAME PROPERTY VALUE SOURCE > system/var/mdbox compression zstd-2 inherited from system/var > system/var/mdbox compressratio 2.29x - > system/var/mdbox used 8.76G - > system/var/mdbox logicalused 20.0G - > > Regards, STefan > > NICE TO KNOW STEFAN, NICE TO KNOW!, > > CHEERS, --=_cc796372be3f66b49dc427201012caa6 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Stefan!!,


Thank you so much. I can't be more thankful.


I answer below in bold blue for instance... for better seeing my answers= =2E...

 


El 2022-04-06 23:59, Stefan Esser escribió:

=
I have got much better compression at same or less load by use of zs= td-2
compared to lz4.
=  
= I assume people would normally uses= lzop due to it's known light cpu load needed...
= I take not too of zstd-2....=
=

Perhaps not typical, since this is a dovecot mdbox formatted = mail pool
holding mostly plain text messages without large attachment= s:

$ df /var/mdbox
Filesystem     &n= bsp;  1K-blocks    Used     &n= bsp;Avail Capacity  Mounted on
system/var/mdbox 7234048944 91708= 88 7224878056     0%    /var/mdbox
=
$ zfs get compression,compressratio,used,logicalused system/var/mdb= ox
NAME           &= nbsp;  PROPERTY       VALUE  &= nbsp;        SOURCE
system/va= r/mdbox  compression    zstd-2     =      inherited from system/var
system/var/md= box  compressratio  2.29x       &nb= sp;   -
system/var/mdbox  used    =        8.76G     &nb= sp;     -
system/var/mdbox  logicalused=    20.0G         &n= bsp; -

Regards, STefan
=  
= Nice to know Stefan, nice to know!,=
=  
= Cheers,
--=_cc796372be3f66b49dc427201012caa6-- From nobody Thu Apr 7 13:43:29 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 5F6141A8C3C4; Thu, 7 Apr 2022 13:43:32 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZ2dv4Jvmz4WyC; Thu, 7 Apr 2022 13:43:31 +0000 (UTC) (envelope-from mike@sentex.net) Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.16.1/8.16.1) with ESMTPS id 237DhUu3068410 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 7 Apr 2022 09:43:30 -0400 (EDT) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:434:73cd:9d42:28ad] ([IPv6:2607:f3e0:0:4:434:73cd:9d42:28ad]) by pyroxene2a.sentex.ca (8.16.1/8.15.2) with ESMTPS id 237DhTQ2017476 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Thu, 7 Apr 2022 09:43:30 -0400 (EDT) (envelope-from mike@sentex.net) Message-ID: <217020e3-0092-b269-96e4-dd77bdcde110@sentex.net> Date: Thu, 7 Apr 2022 09:43:29 -0400 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 Subject: Re: {* 05.00 *}Re: Re: Desperate with 870 QVO and ZFS Content-Language: en-US From: mike tancsa To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> <609373d106c2244a8a2a3e2ca5e6eb73@ramattack.net> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 X-Rspamd-Queue-Id: 4KZ2dv4Jvmz4WyC X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net X-Spamd-Result: default: False [-3.27 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; FREEFALL_USER(0.00)[mike]; FROM_HAS_DN(0.00)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; MIME_GOOD(-0.10)[text/plain]; DMARC_NA(0.00)[sentex.net]; NEURAL_HAM_LONG(-1.00)[-1.000]; RCVD_COUNT_THREE(0.00)[3]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-0.87)[-0.869]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; RCVD_TLS_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received] X-ThisMailContainsUnwantedMimeParts: N On 4/7/2022 7:25 AM, mike tancsa wrote: > On 4/7/2022 4:59 AM, egoitz@ramattack.net wrote: >> >> Hi Mike! >> >> Thanks a lot for your comment. I see. As said before, we didn't >> really enable compression because we just keep the config as FreeBSD >> leaves by default. Apart from that, having tons of disk space and >> well... for avoiding the load of compress/decompress... The main >> reason was it was not enabled by default really and not to have seen >> a real reason for it.... was not more than that....I appreciate your >> comments really :) > > > Think of the extreme case where you do something like > > dd if=/dev/zero of=/tank/junk.bin bs=1m count=10000 > > as this is a 20G file that takes just a few hundred bytes of write IO > on a compressed system. Obviously, as the compress ratio reduces in > the real world the benefits become less.  Where that diminishing > return is, not sure.  But something to keep in mind > > You might also want to have a look at this article which I found quite helpful https://klarasystems.com/articles/openzfs1-understanding-transparent-compression/     ---Mike From nobody Thu Apr 7 13:53:06 2022 X-Original-To: performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 240921A90AA1; Thu, 7 Apr 2022 13:53:17 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZ2s80CnXz4bfR; Thu, 7 Apr 2022 13:53:14 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id 5746F60C4A4; Thu, 7 Apr 2022 15:53:06 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_9966c949ad7be3a0f79023829a1cf786" Date: Thu, 07 Apr 2022 15:53:06 +0200 From: egoitz@ramattack.net To: Stefan Esser Cc: Jan Bramkamp , performance@freebsd.org, freebsd-fs@freebsd.org, FreeBSD Hackers Subject: Re: {* 05.00 *}Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: <49f43af5-e145-c793-959d-ab1596421d81@FreeBSD.org> References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> <49f43af5-e145-c793-959d-ab1596421d81@FreeBSD.org> Message-ID: X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KZ2s80CnXz4bfR X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_9966c949ad7be3a0f79023829a1cf786 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Stefan, Thanks a lot again, mate. Answering below in bold blue... El 2022-04-07 09:59, Stefan Esser escribió: > I have not compared dovecot's zlib compression with zstd-2 on the file system, > but since I use the latter on all my ZFS file systems (excepts those that > exclusively hold compressed files and media), I'm using it for Dovecot mdbox > files, too. I get a compression ratio of 2,29 with ZFS zstd-2, maybe I should > copy the files over into a zlib compressed mdbox for comparison ... > > WE ARE RUNNING CYRUS HERE... ALTHOUGH THAT CHECK SOUNDS INTERESTING... > > One large advantage of the mdbox format in the context of the mail server > set-up at the start of this thread is that deletions are only registered in > an index file (while mbox needs a rewrite of potentially large parts of the > mail folder and mdir immediately deletes files (TRIM) and updates inodes and > directory entries, causing multiple writes per deleted message). > > I SEE... REALLY SAID... I LOVE CYRUS... IT'S REPLICATION IS EXTREMELY RELIABLE... > > SOME TIME NOW... DOVECOT DIDN'T HAD REPLICATION... AND WE HAVE SOME DEVELOPMENTS DONE FROM SOME TIME NOW FOR CYRUS IMAP... > > BUT GOOD TO KNOW TO ABOUT OTHER SOFTWARE'S ADVANTAGES... > > With mdbox you can delay all "expensive" file system operations to the > point of least load each day, for example. Such a compression run is also > well suited for SSDs, since it does not perform random updates that punch > holes in a large number of erase blocks (which then will need to be garbage > collected, causing write amplification to put further load and stress on > the SSD). > > WE DON'T DELETE MAIL DURING DAY HOURS. WE USE A FEATURE OF CYRUS CALLED EXPUNGE DELAYED. THE DELETED EMAIL IS DELETED FROM DISK AT 04AM (UNTIL THAT MOMENT IS JUST TAGGED AS DELETED IN A CYRUS DATABASE). THE EXCEPTION HAPPENS WHEN YOU RENAME A FOLDER OR DELETE AN ENTIRE FOLDER. IF YOU DELETE AN ENTIRE FOLDER, I THINK IT GETS COPIED TO A DELETED/..WHATEVER.. FOLDER AND THEN YES... IT COPIES AND LATER DELETES... > > APART FROM THAT, CYRUS DOES A LOT OF DATABASE CHECKPOINTING, CAUSING DATABASES TO BE COPIED TO A NEW CREATED ONE AND THE OLD ONE TO BE DELETED. THIS ARE THE ONLY REMOVALS WE DO DURING DAY TIME. THE REST IS DONE FROM 04AM TO 05AM, WHEN THERE'S NO LOAD. > > CHEERS!!! --=_9966c949ad7be3a0f79023829a1cf786 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Stefan,


Thanks a lot again, mate. Answering below in bold blue...

 


El 2022-04-07 09:59, Stefan Esser escribió:

=
I have not compared dovecot's zlib compression with zstd-2 on the fi= le system,
but since I use the latter on all my ZFS file systems (exc= epts those that
exclusively hold compressed files and media), I'm usi= ng it for Dovecot mdbox
files, too. I get a compression ratio of 2,29= with ZFS zstd-2, maybe I should
copy the files over into a zlib comp= ressed mdbox for comparison ...
=  
= We are running Cyrus here... althou= gh that check sounds interesting...

One large = advantage of the mdbox format in the context of the mail server
set-u= p at the start of this thread is that deletions are only registered in
an index file (while mbox needs a rewrite of potentially large parts of t= he
mail folder and mdir immediately deletes files (TRIM) and updates = inodes and
directory entries, causing multiple writes per deleted mes= sage).
=  
= I see... really said... I love Cyru= s... it's replication is extremely reliable...
=  
= Some time now... Dovecot didn't had= replication... and we have some developments done from some time now for C= yrus IMAP...
=  
= But good to know to about other sof= tware's advantages...

With mdbox you can delay= all "expensive" file system operations to the
point of least load ea= ch day, for example. Such a compression run is also
well suited for S= SDs, since it does not perform random updates that punch
holes in a l= arge number of erase blocks (which then will need to be garbage
colle= cted, causing write amplification to put further load and stress on
t= he SSD).
=  
= We don't delete mail during day hou= rs. We use a feature of Cyrus called expunge delayed. The deleted email is = deleted from disk at 04am (until that moment is just tagged as deleted in a= Cyrus database). The exception happens when you rename a folder or delete = an entire folder. If you delete an entire folder, I think it gets copied to= a DELETED/..whatever.. folder and then yes... it copies and later deletes= =2E..
=  
= Apart from that, Cyrus does a lot o= f database checkpointing, causing databases to be copied to a new created o= ne and the old one to be deleted. This are the only removals we do during d= ay time. The rest is done from 04am to 05am, when there's no load.
=  
= Cheers!!!
--=_9966c949ad7be3a0f79023829a1cf786-- From nobody Thu Apr 7 13:57:33 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 8B1721A931A5; Thu, 7 Apr 2022 13:57:37 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZ2y807fRz4dhJ; Thu, 7 Apr 2022 13:57:36 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id F221A60C10A; Thu, 7 Apr 2022 15:57:33 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_49ce93a42b87a5e038a30c28c67e0d83" Date: Thu, 07 Apr 2022 15:57:33 +0200 From: egoitz@ramattack.net To: Stefan Esser Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner Subject: Re: {* 05.00 *}Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: <4ef109e8-bd7b-1398-2bc9-191e261d5c06@FreeBSD.org> References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <4ef109e8-bd7b-1398-2bc9-191e261d5c06@FreeBSD.org> Message-ID: X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KZ2y807fRz4dhJ X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCVD_TLS_LAST(0.00)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24:c]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_49ce93a42b87a5e038a30c28c67e0d83 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi!! Thanks ;) really man :) Answering below in bold blue... El 2022-04-07 12:05, Stefan Esser escribió: > I just noticed that this is not the extreme total size of a ZFS pool > (should have noticed this while answering late at night ...) > > And no, a specified life-time of 2880 TB written is not much, it is > at the absolute lower end of currently available SSDs at 360 TB per > 1 TB of capacity. > > YEP THAT'S IT... > > This is equivalent to 360 total capacity writes, but given the high > amount of write amplification that can be assumed to occur in your > use case, I'd heavily over-provision a system with such SSDs ... > (or rather: strictly avoid them in a non-consumer setting). > > It's slightly late for over-provisioning... you know... we have done the zpool create with the whole disk..not just a slice.... > > WE CAN TRY TO BE SLIGHTLY FAR FROM THE 80% CAPACITY LIMIT BUT.... NOT POSSIBLE NOW... AT LEAST FOR THIS GROUP OF SERVERS... > > CHEERS! --=_49ce93a42b87a5e038a30c28c67e0d83 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi!!


Thanks ;) really man :)


Answering below in bold blue...

 


El 2022-04-07 12:05, Stefan Esser escribió:

=

I just noticed that this is not the extreme total size of a ZF= S pool
(should have noticed this while answering late at night ...)
And no, a specified life-time of 2880 TB written is not much, i= t is
at the absolute lower end of currently available SSDs at 360 TB = per
1 TB of capacity.
=  
= Yep that's it...
This is equivalent to 360 total capacity writes, but given the h= igh
amount of write amplification that can be assumed to occur in you= r
use case, I'd heavily over-provision a system with such SSDs ... (or rather: strictly avoid them in a non-consumer setting).
=  
= It's slightly late for over-provisioning... you know... we have done the zp= ool create with the whole disk..not just a slice....
=  
= We can try to be slightly far from = the 80% capacity limit but.... not possible now... at least for this group = of servers...
=  
= Cheers!
--=_49ce93a42b87a5e038a30c28c67e0d83-- From nobody Thu Apr 7 14:01:04 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id F299B1A9499D; Thu, 7 Apr 2022 14:01:07 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZ32B4GXpz4gtl; Thu, 7 Apr 2022 14:01:06 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id A4C8160C00B; Thu, 7 Apr 2022 16:01:04 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_74b7e4b05962e4b9793f3fbab21fa2c0" Date: Thu, 07 Apr 2022 16:01:04 +0200 From: egoitz@ramattack.net To: mike tancsa Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance Subject: Re: {* 05.00 *}Re: Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> <609373d106c2244a8a2a3e2ca5e6eb73@ramattack.net> Message-ID: <1c114208ae8387fc7c4291d1cc66227e@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KZ32B4GXpz4gtl X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-0.999]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_74b7e4b05962e4b9793f3fbab21fa2c0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Mike! Thanks a lot for your answer :) :) and your time :) :) Answering below in blue bold... El 2022-04-07 13:25, mike tancsa escribió: > On 4/7/2022 4:59 AM, egoitz@ramattack.net wrote: > >> > Hi, > > With respect to compression, I think there is a sweet spot somewhere, where compression makes things faster if your disk IO is the limiting factor and you have spare CPU capacity. I have a separate 13.x zfs server with ztsd enabled and I get compression rations of 15:1 as it stores a lot of giant JSON txt files. > > ZTSD OR ZTSD-2 AS IN SOME MAIL PREVIOUS STEFAN STATED?. I ASSUME ARE NOT THE SAME... ARE THEY?. > > Think of the extreme case where you do something like > > dd if=/dev/zero of=/tank/junk.bin bs=1m count=10000 > > as this is a 20G file that takes just a few hundred bytes of write IO on a compressed system. Obviously, as the compress ratio reduces in the real world the benefits become less. Where that diminishing return is, not sure. But something to keep in mind > > TOTALLY TRUE AND TOTALLY AGREE MIKE!! > CHEERS!! --=_74b7e4b05962e4b9793f3fbab21fa2c0 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Mike!


Thanks a lot for your answer :) :) and your time :) :)


Answering below in blue bold...

 


El 2022-04-07 13:25, mike tancsa escribió:

=
On 4/7/2022 4:59 = ;AM, egoitz@ramattack.net&= nbsp;wrote:

Hi,

  &n= bsp; With respect to compression, I think there is a sweet spot somewhere, = where compression makes things faster if your disk IO is the limiting facto= r and you have spare CPU capacity.  I have a separate 13.x zfs server = with ztsd enabled and I get compression rations of 15:1 as it stores a lot = of giant JSON txt files.
=  
= ztsd or ztsd-2 as in some mail prev= ious Stefan stated?. I assume are not the same... are they?.

Think of the&= nbsp;extreme case where you do something like=

dd if=3D/dev= /zero of=3D/tank/junk.bin bs=3D1m count=3D10000
=  
= as this is a 20G file that takes just a few hundred bytes of write IO on a = compressed system. Obviously, as the compress ratio reduces in the real wor= ld the benefits become less.  Where that diminishing return is, not su= re.  But something to keep in mind
=  
= Totally true and totally agree Mike= !!
=  
Cheers!! --=_74b7e4b05962e4b9793f3fbab21fa2c0-- From nobody Thu Apr 7 14:02:16 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 42EBC1A956E8; Thu, 7 Apr 2022 14:02:21 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZ33b5367z4hm4; Thu, 7 Apr 2022 14:02:19 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id E0F1060C82E; Thu, 7 Apr 2022 16:02:16 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_14f3bd38668fdecfedfad3f860fc22f3" Date: Thu, 07 Apr 2022 16:02:16 +0200 From: egoitz@ramattack.net To: mike tancsa Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, Freebsd performance Subject: Re: {* 05.00 *}Re: Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: <217020e3-0092-b269-96e4-dd77bdcde110@sentex.net> References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <28e11d7ec0ac5dbea45f9f271fc28f06@ramattack.net> <7aa95cb4bf1fd38b3fce93bc26826042@ramattack.net> <609373d106c2244a8a2a3e2ca5e6eb73@ramattack.net> <217020e3-0092-b269-96e4-dd77bdcde110@sentex.net> Message-ID: X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KZ33b5367z4hm4 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; RCPT_COUNT_THREE(0.00)[4]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24:c]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; RCVD_TLS_LAST(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_14f3bd38668fdecfedfad3f860fc22f3 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Sure!!! good to know mate!!! Thanks Mike!! El 2022-04-07 15:43, mike tancsa escribió: > ATENCION > ATENCION > ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > On 4/7/2022 7:25 AM, mike tancsa wrote: On 4/7/2022 4:59 AM, egoitz@ramattack.net wrote: > Hi Mike! > > Thanks a lot for your comment. I see. As said before, we didn't really enable compression because we just keep the config as FreeBSD leaves by default. Apart from that, having tons of disk space and well... for avoiding the load of compress/decompress... The main reason was it was not enabled by default really and not to have seen a real reason for it.... was not more than that....I appreciate your comments really :) > > Think of the extreme case where you do something like > > dd if=/dev/zero of=/tank/junk.bin bs=1m count=10000 > > as this is a 20G file that takes just a few hundred bytes of write IO on a compressed system. Obviously, as the compress ratio reduces in the real world the benefits become less. Where that diminishing return is, not sure. But something to keep in mind You might also want to have a look at this article which I found quite helpful https://klarasystems.com/articles/openzfs1-understanding-transparent-compression/ ---Mike --=_14f3bd38668fdecfedfad3f860fc22f3 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Sure!!! good to know mate!!!


Thanks Mike!!

 


El 2022-04-07 15:43, mike tancsa escribió:

= ATENCION
ATENCION
ATENCION= !!! Este correo se ha enviado desde fuer= a de la organizacion. No pinche en los&n= bsp;enlaces ni abra los adjuntos a no se= r que reconozca el remitente y sepa que&= nbsp;el contenido es seguro.

On 4/7/2022 7:25 AM, mike&nbs= p;tancsa wrote:
On 4/7/2022&= nbsp;4:59 AM, egoitz@rama= ttack.net wrote:

Hi Mi= ke!

Thanks a lot for your comment. I see. As said befor= e, we didn't really enable compression because we just keep the config as F= reeBSD leaves by default. Apart from that, having tons of disk space and we= ll... for avoiding the load of compress/decompress... The main reason was i= t was not enabled by default really and not to have seen a real reason for = it.... was not more than that....I appreciate your comments really :)

Think of the&n= bsp;extreme case where you do something like<= /span>

dd if=3D/dev/= zero of=3D/tank/junk.bin bs=3D1m count=3D10000
=
as this is a 20G file that takes just a few hundred bytes of write I= O on a compressed system. Obviously, as the compress ratio reduces in the r= eal world the benefits become less.  Where that diminishing return is,= not sure.  But something to keep in mind

You might also want to have a look at this article which I found quite help= ful


https://klarasystems.com= /articles/openzfs1-understanding-transparent-compression/
=

    = ---Mike

--=_14f3bd38668fdecfedfad3f860fc22f3-- From nobody Fri Apr 8 11:14:24 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 5F68D1A89133; Fri, 8 Apr 2022 11:14:29 +0000 (UTC) (envelope-from se@FreeBSD.org) Received: from smtp.freebsd.org (smtp.freebsd.org [IPv6:2610:1c1:1:606c::24b:4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (4096 bits) client-digest SHA256) (Client CN "smtp.freebsd.org", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZbHT1nGhz4rdQ; Fri, 8 Apr 2022 11:14:29 +0000 (UTC) (envelope-from se@FreeBSD.org) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649416469; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=JlfXsy1DWMiP2fVpvEdmDICcRyZdKcm3477CxCwciNE=; b=Y6LhFNt3WUDToCHPlXDod+JjjeS2FY/J2jIloPEnNhXOV9FDKr9cJQb1V++NRzEM1UERTp jI23Rprjxa1V0rrGYBlwm49iE2X/ja2I9Tf+UNWb9RozDIXpiR8zByltboDcpJjWZ6HEm5 A1HDlxji4iJcsf+8tJyvvTDqdZ3c+z5SJnwe+Y4DrNBcmLfZgvUyt4EUc2nNR8F3BWKNFn SXbRNwJQ9xPxhM/1N09ifCmQbafAsF9TaxYC4dNjoUaTq/S8sSK39JO/xwx0b6Hj4ZwZYZ DWg4RiRmnKngl/ks1Gv4qlx9ZZqsnCALJnGPihWjcsO11kvxvdIk+n1GFzytCA== Received: from [IPV6:2003:cd:5f22:6f00:acb8:ea4e:7d58:95d3] (p200300cd5f226f00acb8ea4e7d5895d3.dip0.t-ipconnect.de [IPv6:2003:cd:5f22:6f00:acb8:ea4e:7d58:95d3]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client did not present a certificate) (Authenticated sender: se/mail) by smtp.freebsd.org (Postfix) with ESMTPSA id 3446729D1D; Fri, 8 Apr 2022 11:14:28 +0000 (UTC) (envelope-from se@FreeBSD.org) Message-ID: Date: Fri, 8 Apr 2022 13:14:24 +0200 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.7.0 From: Stefan Esser Subject: Re: Desperate with 870 QVO and ZFS To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> Content-Language: en-US In-Reply-To: Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="------------TD2ASvjzfpo4JyxrFz8JyiVP" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim; t=1649416469; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=JlfXsy1DWMiP2fVpvEdmDICcRyZdKcm3477CxCwciNE=; b=IXz5kNfN6/gK87Bf0RxfSqtmz2VZjPDhepICW9REQpiljoVUgpcckFjxx47bHN2L0BqifB 9cpYwYl9Eh97xYJzhb1MdamL03nz0HGWR+gY8fCgqy97CzrJmmY3RdkOOI4Zq0Yz2MGDCe CD7IAVJSWSiB8Sw+eyj/FzCM6UjIwdrtunr83VoWu7WvBA1tKJo3TxAjySFQLR1w3Usi2u BpzvrBGOI9M/1snImXDynjsDnraMcyfdAgTJ4sHmkadfFxDRd+veqPWJlOLpPtlTwd/9Ia MN3n6hgFCcr7fGs+9e3YUjikKCmzg0r8bN4/G1RmxPKv2D9kpTCziUufYrcUuA== ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1649416469; a=rsa-sha256; cv=none; b=NyzkQYnw9EkFAz5oOmjM4k+WMKdyljH4w9H3KOYLL3/lSfbdOwfMsk9/9sfaLQ9U6QlwbB TZRXIV3oCBtIYl6Jgy9kwMrDXMd0yzA9TuXp7CtdeIyMAlvIxHRdj2/TkBbYzGf9+TUe5F tbUnzJbaWHKGZlEqtrVxOGwU5eI/y2CeTWuUVBVXiMkTTuMYdCXEM1uopIx0/omBCdMwH2 W12zvrJ634Fd037bWR/5gP+AYWjxKzGmYbV3roX0QPVmKWD76tz587NCuXJEcRwZon1lkR 9gO9p5BqVEax7ZkE+2Uxi1hknhkT+le2uRMD9pPvvK7B+8GSHDmOvYbwpyRrpg== ARC-Authentication-Results: i=1; mx1.freebsd.org; none X-ThisMailContainsUnwantedMimeParts: N This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --------------TD2ASvjzfpo4JyxrFz8JyiVP Content-Type: multipart/mixed; boundary="------------O0ssWPEiEQcQgZRJEKUGD0ko"; protected-headers="v1" From: Stefan Esser To: egoitz@ramattack.net Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner Message-ID: Subject: Re: Desperate with 870 QVO and ZFS References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> In-Reply-To: --------------O0ssWPEiEQcQgZRJEKUGD0ko Content-Type: multipart/alternative; boundary="------------5zqkLuBRyXvluXGRQjtNvCUK" --------------5zqkLuBRyXvluXGRQjtNvCUK Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Am 07.04.22 um 14:30 schrieb egoitz@ramattack.net: > El 2022-04-06 23:49, Stefan Esser escribi=C3=B3: >>> >>> El 2022-04-06 17:43, Stefan Esser escribi=C3=B3: >>> >>> >>> Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net: >>> >>> Hi Rainer! >>> >>> Thank you so much for your help :) :) >>> >>> Well I assume they are in a datacenter and should not be a po= wer >>> outage.... >>> >>> About dataset size... yes... our ones are big... they can be = 3-4 TB >>> easily each >>> dataset..... >>> >>> We bought them, because as they are for mailboxes and mailbox= es >>> grow and >>> grow.... for having space for hosting them... >>> >>> >>> Which mailbox format (e.g. mbox, maildir, ...) do you use? >>> =C2=A0 >>> *I'm running Cyrus imap so sort of Maildir... too many little fil= es >>> normally..... Sometimes directories with tons of little files....= * >>> >> Assuming that many mails are much smaller than the erase block size of= the >> SSD, this may cause issues. (You may know the following ...) >> >> For example, if you have message sizes of 8 KB and an erase block size= of 64 >> KB (just guessing), then 8 mails will be in an erase block. If half th= e >> mails are deleted, then the erase block will still occupy 64 KB, but o= nly >> hold 32 KB of useful data (and the SSD will only be aware of this fact= if >> TRIM has signaled which data is no longer relevant). The SSD will copy= >> several partially filled erase blocks together in a smaller number of = free >> blocks, which then are fully utilized. Later deletions will repeat thi= s >> game, and your data will be copied multiple times until it has aged (a= nd the >> user is less likely to delete further messages). This leads to "write >> amplification" - data is internally moved around and thus written mult= iple >> times. >> >> >> *Stefan!! you are nice!! I think this could explain all our problem. S= o, why >> we are having the most randomness in our performance degradation and t= hat >> does not necessarily has to match with the most io peak hours... That = I >> could cause that performance degradation just by deleting a couple of = huge >> (perhaps 200.000 mails) mail folders in a middle traffic hour time!!* >> Yes, if deleting large amounts of data triggers performance issues (and t= he disk does not have a deficient TRIM implementation), then the issue is li= kely to be due to internal garbage collections colliding with other operations= =2E >> >> *The problem is that by what I know, erase block size of an SSD disk i= s >> something fixed in the disk firmware. I don't really know if perhaps i= t >> could be modified with Samsung magician or those kind of tool of Samsu= ng.... >> else I don't really see the manner of improving it... because apart fr= om >> that, you are deleting a file in raidz-2 array... no just in a disk...= I >> assume aligning chunk size, with record size and with the "secret" era= se >> size of the ssd, perhaps could be slightly compensated?.* >> The erase block size is a fixed hardware feature of each flash chip. Ther= e is a block size for writes (e.g. 8 KB) and many such blocks are combined in on= e erase block (of e.g. 64 KB, probably larger in todays SSDs), they can onl= y be returned to the free block pool all together. And if some of these writab= le blocks hold live data, they must be preserved by collecting them in newly= allocated free blocks. An example of what might happen, showing a simplified layout of files 1, = 2, 3 (with writable blocks 1a, 1b, ..., 2a, 2b, ... and "--" for stale data of= deleted files, ".." for erased/writable flash blocks) in an SSD might be:= erase block 1: |1a|1b|--|--|2a|--|--|3a| erase block 2; |--|--|--|2b|--|--|--|1c| erase block 3; |2c|1d|3b|3c|--|--|--|--| erase block 4; |..|..|..|..|..|..|..|..| This is just a random example how data could be laid out on the physical storage array. It is assumed that the 3 erase blocks once were completely= occupied In this example, 10 of 32 writable blocks are occupied, and only one free= erase block exists. This situation must not persist, since the SSD needs more empty erase blo= cks. 10/32 of the capacity is used for data, but 3/4 of the blocks are occupie= d and not immediately available for new data. The garbage collection might combine erase blocks 1 and 3 into a currentl= y free one, e.g. erase block 4: erase block 1; |..|..|..|..|..|..|..|..| erase block 2; |--|--|--|2b|--|--|--|1c| erase block 3; |..|..|..|..|..|..|..|..| erase block 4: |1a|1b|2a|3a|2c|1d|3b|3c| Now only 2/4 of the capacity is not available for new data (which is stil= l a lot more than 10/32, but better than before). Now assume file 2 is deleted: erase block 1; |..|..|..|..|..|..|..|..| erase block 2; |--|--|--|--|--|--|--|1c| erase block 3; |..|..|..|..|..|..|..|..| erase block 4: |1a|1b|--|3a|--|1d|3b|3c| There is now a new sparsely used erase block 4, and it will soon need to = be garbage collected, too - in fact it could be combined with the live data = from erase block 2, but this may be delayed until there is demand for more era= sed blocks (since e.g. file 1 or 3 might also have been deleted by then). The garbage collection does not know which data blocks belong to which fi= le, and therefore it cannot collect the data belonging to a file into a singl= e erase block. Blocks are allocated as data comes in (as long as enough SLC= cells are available in this area, else directly in QLC cells). Your many parall= el updates will cause fractions of each larger file to be spread out over ma= ny erase blocks. As you can see, a single file that is deleted may affect many erase block= s, and you have to take redundancy into consideration, which will multiply the e= ffect by a factor of up to 3 for small files (one ZFS allocation block). And remember: deleting a message in mdir format will free the data blocks, bu= t will also remove the directory entry, causing additional meta-data writes (aga= in multiplied by the raid redundancy). A consumer SSD would normally see only very few parallel writes, and sequ= ential writes of full files will have a high chance to put the data of each file= contiguously in the minimum number of erase blocks, allowing to free mult= iple complete erase blocks when such a file is deleted and thus obviating the = need for many garbage collection copies (that occur if data from several indep= endent files is in one erase block). Actual SSDs have many more cells than advertised. Some 10% to 20% may be = kept as a reserve for aging blocks that e.g. may have failed kind of a "read-after-write test" (implemented in the write function, which adds ch= arges to the cells until they return the correct read-outs). BTW: Having an ashift value that is lower than the internal write block s= ize may also lead to higher write amplification values, but a large ashift ma= y lead to more wasted capacity, which may become an issue if typical file length= are much smaller than the allocation granularity that results from the ashift= value. >> Larger mails are less of an issue since they span multiple erase block= s, >> which will be completely freed when such a message is deleted. >> >> *I see I see Stefan...* >> >> Samsung has a lot of experience and generally good strategies to deal = with >> such a situation, but SSDs specified for use in storage systems might = be >> much better suited for that kind of usage profile. >> >> *Yes... and the disks for our purpose... perhaps weren't QVOs....* >> You should have got (much more expensive) server grade SSDs, IMHO. But even 4 * 2 TB QVO (or better EVO) drives per each 8 TB QVO drive woul= d result in better performance (but would need a lot of extra SATA ports). In fact, I'm not sure whether rotating media and a reasonable L2ARC consi= sting of a fast M.2 SSD plus a mirror of small SSDs for a LOG device would not = be a better match for your use case. Reading the L2ARC would be very fast, wri= tes would be purely sequential and relatively slow, you could choose a suitab= le L2ARC strategy (caching of file data vs. meta data), and the LOG device w= ould support fast fsync() operations required for reliable mail systems (which= confirm data is on stable storage before acknowledging the reception to t= he sender). >>> We knew they had some speed issues, but those speed issues, w= e >>> thought (as >>> Samsung explains in the QVO site) they started after exceedin= g the >>> speeding >>> buffer this disks have. We though that meanwhile you didn't e= xceed it's >>> capacity (the capacity of the speeding buffer) no speed probl= em >>> arises. Perhaps >>> we were wrong?. >>> >>> >>> These drives are meant for small loads in a typical PC use case, >>> i.e. some installations of software in the few GB range, else onl= y >>> files of a few MB being written, perhaps an import of media files= >>> that range from tens to a few hundred MB at a time, but less ofte= n >>> than once a day. >>> =C2=A0 >>> *We move, you know... lots of little files... and lot's of differ= ent >>> concurrent modifications by 1500-2000 concurrent imap connections= we >>> have...* >>> >> I do not expect the read load to be a problem (except possibly when th= e SSD >> is moving data from SLC to QLC blocks, but even then reads will get >> priority). But writes and trims might very well overwhelm the SSD, >> especially when its getting full. Keeping a part of the SSD unused (ex= cluded >> from the partitions created) will lead to a large pool of unused block= s. >> This will reduce the write amplification - there are many free blocks = in the >> "unpartitioned part" of the SSD, and thus there is less urgency to com= pact >> partially filled blocks. (E.g. if you include only 3/4 of the SSD capa= city >> in a partition used for the ZPOOL, then 1/4 of each erase block could = be >> free due to deletions/TRIM without any compactions required to hold al= l this >> data.) >> >> Keeping a significant percentage of the SSD unallocated is a good stra= tegy >> to improve its performance and resilience. >> >> *Well, we have allocated all the disk space... but not used... just >> allocated.... you know... we do a zpool create with the whole disks...= =2E.* >> I think the only chance for a solution that does not require new hardware= is to make sure, only some 80% of the SSDs are used (i.e. allocate only 80% for= ZFS, leave 20% unallocated). This will significantly reduce the rate of garbag= e collections and thus reduce the load they cause. I'd use a fast encryption algorithm (zstd - choose a level that does not overwhelm the CPU, there are benchmark results for ZFS with zstd, and I f= ound zstd-2 to be best for my use case). This will more than make up for the s= pace you left unallocated on the SSDs. A different mail box format might help, too - I'm happy with dovecot's md= box format, which is as fast but much more efficient than mdir. >>> As the SSD fills, the space available for the single level write >>> cache gets smaller >>> =C2=A0 >>> *The single level write cache is the cache these ssd drivers have= , for >>> compensating the speed issues they have due to using qlc memory?.= Do >>> you refer to that?. Sorry I don't understand well this paragraph.= * >>> >> Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SL= C >> cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cel= ls as >> 24 GB of data in QLC mode. >> >> *Ok, true.... yes....* >> >> A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (= 600 GB >> in 150 tn QLC cells plus 100 GB in 100 tn SLC cells). >> >> *Ahh! you mean that SLC capacity for speeding up the QLC disks, is obt= ained >> from each single layer of the QLC?.* >> There are no specific SLC cells. A fraction of the QLC capable cells is o= nly written with only 1 instead of 4 bits. This is a much simpler process, si= nce there are only 2 charge levels per cell that are used, while QLC uses 16 = charge levels, and you can only add charge (must not overshoot), therefore only = small increments are added until the correct value can be read out). But since SLC cells take away specified capacity (which is calculated ass= uming all cells hold 4 bits each, not only 1 bit), their number is limited and shrinks as demand for QLC cells grows. The advantage of the SLC cache is fast writes, but also that data in it m= ay have become stale (trimmed) and thus will never be copied over into a QLC= block. But as the SSD fills and the size of the SLC cache shrinks, this capability will be mostly lost, and lots of very short lived data is stor= ed in QLC cells, which will quickly become partially stale and thus needing compaction as explained above. >> Therefore, the fraction of the cells used as an SLC cache is reduced w= hen it >> gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cell= s). >> >> *Sorry I don't get this last sentence... don't understand it because I= don't >> really know the meaning of tn... * >> >> *but I think I'm getting the idea if you say that each QLC layer, has = it's >> own SLC cache obtained from the disk space avaiable for each QLC layer= =2E...* >> >> And with less SLC cells available for short term storage of data the >> probability of data being copied to QLC cells before the irrelevant me= ssages >> have been deleted is significantly increased. And that will again lead= to >> many more blocks with "holes" (deleted messages) in them, which then n= eed to >> be copied possibly multiple times to compact them. >> >> *If I correct above, I think I got the idea yes....* >> >>> (on many SSDs, I have no numbers for this >>> particular device), and thus the amount of data that can be >>> written at single cell speed shrinks as the SSD gets full. >>> =C2=A0 >>> >>> I have just looked up the size of the SLC cache, it is specified >>> to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB= >>> version, smaller models will have a smaller SLC cache). >>> =C2=A0 >>> *Assuming you were talking about the cache for compensating speed= we >>> previously commented, I should say these are the 870 QVO but the = 8TB >>> version. So they should have the biggest cache for compensating t= he >>> speed issues...* >>> >> I have looked up the data: the larger versions of the 870 QVO have the= same >> SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 G= B more >> if there are enough free blocks. >> >> *Ours one is the 8TB model so I assume it could have bigger limits. Th= e >> disks are mostly empty, really.... so... for instance....* >> >> *zpool list* >> *NAME=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 SIZE=C2=A0 ALLOC=C2=A0=C2=A0 FREE=C2=A0 CKPOINT=C2=A0 EXPANDSZ=C2=A0= =C2=A0 FRAG=C2=A0=C2=A0=C2=A0 CAP=C2=A0 >> DEDUP=C2=A0 HEALTH=C2=A0 ALTROOT* >> *root_dataset=C2=A0 448G=C2=A0 2.29G=C2=A0=C2=A0 446G=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= -=C2=A0=C2=A0=C2=A0=C2=A0 1%=C2=A0=C2=A0=C2=A0=C2=A0 0%=C2=A0 1.00x=C2=A0= >> ONLINE=C2=A0 -* >> *mail_dataset=C2=A0 58.2T=C2=A0 11.8T=C2=A0 46.4T=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0 -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -= =C2=A0=C2=A0=C2=A0 26%=C2=A0=C2=A0=C2=A0 20%=C2=A0 1.00x=C2=A0 >> ONLINE=C2=A0 -* >> Ok, seems you have got 10 * 8 TB in a raidz2 configuration. Only 20% of the mail dataset is in use, the situation will become much wo= rse when the pool will fill up! >> *I suppose fragmentation affects too....* >> On magnetic media fragmentation means that a file is spread out over the = disk in a non-optimal way, causing access latencies due to seeks and rotationa= l delay. That kind of fragmentation is not really relevant for SSDs, which = allow for fast random access to the cells. And the FRAG value shown by the "zpool list" command is not about fragmen= tation of files at all, it is about the structure of free space. Anyway less rel= evant for SSDs than for classic hard disk drives. >>> But after writing those few GB at a speed of some 500 MB/s (i.e. >>> after 12 to 150 seconds), the drive will need several minutes to >>> transfer those writes to the quad-level cells, and will operate >>> at a fraction of the nominal performance during that time. >>> (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for t= he >>> 2 TB model.) >>> =C2=A0 >>> *Well we are in the 8TB model. I think I have understood what you= wrote >>> in previous paragraph. You said they can be fast but not constant= ly, >>> because later they have to write all that to their perpetual stor= age >>> from the cache. And that's slow. Am I wrong?. Even in the 8TB mod= el you >>> think Stefan?.* >>> >> The controller in the SSD supports a given number of channels (e.g 4),= each >> of which can access a Flash chip independently of the others. Small SS= Ds >> often have less Flash chips than there are channels (and thus a lower >> throughput, especially for writes), but the larger models often have m= ore >> chips than channels and thus the performance is capped. >> >> *This is totally logical. If a QVO disk would outperform best or simil= ar >> than an Intel without consequences.... who was going to buy a expensiv= e >> Intel enterprise?.* >> The QVO is bandwidth limited due to the SATA data rate of 6 Mbit/s anyway= , and it is optimized for reads (which are not significantly slower than offere= d by the TLC models). This is a viable concept for a consumer PC, but not for = a server. >> >> In the case of the 870 QVO, the controller supports 8 channels, which = allows >> it to write 160 MB/s into the QLC cells. The 1 TB model apparently has= only >> 4 Flash chips and is thus limited to 80 MB/s in that situation, while = the >> larger versions have 8, 16, or 32 chips. But due to the limited number= of >> channels, the write rate is limited to 160 MB/s even for the 8 TB mode= l. >> >> *Totally logical Stefan...* >> >> If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in t= his limit. >> >>> *The main problem we are facing is that in some peak moments, whe= n the >>> machine serves connections for all the instances it has, and only= as >>> said in some peak moments... like the 09am or the 11am.... it see= ms the >>> machine becomes slower... and like if the disks weren't able to s= erve >>> all they have to serve.... In these moments, no big files are mov= ed... >>> but as we have 1800-2000 concurrent imap connections... normally = they >>> are doing each one... little changes in their mailbox. Do you thi= nk >>> perhaps this disks then are not appropriate for this kind of usag= e?-* >>> >> I'd guess that the drives get into a state in which they have to recyc= le >> lots of partially free blocks (i.e. perform kind of a garbage collecti= on) >> and then three kinds of operations are competing with each other: >> >> 1. reads (generally prioritized) >> 2. writes (filling the SLC cache up to its maximum size) >> 3. compactions of partially filled blocks (required to make free bloc= ks >> available for re-use) >> >> Writes can only proceed if there are sufficient free blocks, which on = a >> filled SSD with partially filled erase blocks means that operations of= type >> 3. need to be performed with priority to not stall all writes. >> >> My assumption is that this is what you are observing under peak load. >> >> *It could be although the disks are not filled.... the pool are at 20 = or 30% >> of capacity and fragmentation from 20%-30% (as zpool list states).* >> Yes, and that means that your issues will become much more critical over = time when the free space shrinks and garbage collections will be required at a= n even faster rate, with the SLC cache becoming less and less effective to weed = out short lived files as an additional factor that will increase write amplif= ication. >>> >>> And cheap SSDs often have no RAM cache (not checked, but I'd be >>> surprised if the QVO had one) and thus cannot keep bookkeeping da= te >>> in such a cache, further limiting the performance under load. >>> =C2=A0 >>> *This brochure >>> (https://semiconductor.samsung.com/resources/brochure/870_Series_= Brochure.pdf >>> and the datasheet >>> https://semiconductor.samsung.com/resources/data-sheet/Samsung_SS= D_870_QVO_Data_Sheet_Rev1.1.pdf) >>> sais if I have read properly, the 8TB drive has 8GB of ram?. I as= sume >>> that is what they call the turbo write cache?.* >>> >> No, the turbo write cache consists of the cells used in SLC mode (whic= h can >> be any cells, not only cells in a specific area of the flash chip). >> >> *I see I see....* >> >> The RAM is needed for fast lookup of the position of data for reads an= d of >> free blocks for writes. >> >> *Our ones... seem to have 8GB LPDDR4 of ram.... as datasheet states...= =2E* >> Yes, and it makes sense that the RAM size is proportional to the capacity= since a few bytes are required per addressable data block. If the block size was 8 KB the RAM could hold 8 bytes (e.g. a pointer and= some status flags) for each logically addressable block. But there is no infor= mation about the actual internal structure of the QVO that I know of. [...] >> >> *I see.... It's extremely misleading you know... because... you can co= py >> five mailboxes of 50GB concurrently for instance.... and you flood a g= igabit >> interface copying (obviously because disks can keep that throughput)..= =2E but >> later.... you see... you are in an hour that yesterday, and even 4 day= s >> before you have not had any issues... and that day... you see the comm= ented >> issue... even not being exactly at a peak hour (perhaps is two hours l= ater >> the peak hour even)... or... but I wasn't noticing about all things yo= u say >> in this email....* >> >> I have seen advice to not use compression in a high load scenario in s= ome >> other reply. >> >> I tend to disagree: Since you seem to be limited when the SLC cache is= >> exhausted, you should get better performance if you compress your data= =2E I >> have found that zstd-2 works well for me (giving a significant overall= >> reduction of size at reasonable additional CPU load). Since ZFS allows= to >> switch compressions algorithms at any time, you can experiment with >> different algorithms and levels. >> >> *I see... you say compression should be enabled.... The main reason be= cause >> we have not enabled it yet, is for keeping the system the most near po= ssible >> to config defaults... you know... for later being able to ask in this >> mailing lists if we have an issue... because you know... it would be f= ar >> more easier to ask about something strange you are seeing when that st= range >> thing is near to a well tested config, like the config by default....*= >> >> *But now you say Stefan... if you switch between compression algorithm= s you >> will end up with a mix of different files compressed in a different >> manner... that is not a bit disaster later?. Doesn't affect performanc= e in >> some manner?.* >> The compression used is stored in the per file information, each file in = a dataset could have been written with a different compression method and l= evel. Blocks are independently compressed - a file level compression may be mor= e effective. Large mail files will contain incompressible attachments (alre= ady compressed), but in base64 encoding. This should allow a compression rati= o of ~1,3. Small files will be plain text or HTML, offering much better compre= ssion factors. >> >> One advantage of ZFS compression is that it applies to the ARC, too. A= nd a >> compression factor of 2 should easily be achieved when storing mail (n= ot for >> .docx, .pdf, .jpg files though). Having more data in the ARC will redu= ce the >> read pressure on the SSDs and will give them more cycles for garbage >> collections (which are performed in the background and required to alw= ays >> have a sufficient reserve of free flash blocks for writes). >> >> *We would use I assume the lz4... which is the less "expensive" compre= ssion >> algorithm for the CPU... and I assume too for avoiding delay accessing= >> data... do you recommend another one?. Do you always recommend compres= sion >> then?.* >> I'd prefer zstd over lz4 since it offers a much higher compression ratio.= Zstd offers higher compression ratios than lz4 at similar or better decompression speed, but may be somewhat slower compressing the data. But= in my opinion this is outweighed by the higher effective amount of data in the ARC/L2ARC possible with zstd. For some benchmarks of different compression algorithms available for ZFS= and compared to uncompressed mode see the extensive results published by Jude= Allan: https://docs.google.com/spreadsheets/d/1TvCAIDzFsjuLuea7124q-1UtMd0C9amTg= nXm2yPtiUQ/edit?usp=3Dsharing The SQL benchmarks might best resemble your use case - but remember that = a significant reduction of the amount of data being written to the SSDs m= ight be more important than the highest transaction rate, since your SSDs= put a low upper limit on that when highly loaded. >> I'd give it a try - and if it reduces your storage requirements by 10%= only, >> then keep 10% of each SSD unused (not assigned to any partition). That= will >> greatly improve the resilience of your SSDs, reduce the write-amplific= ation, >> will allow the SLC cache to stay at its large value, and may make a la= rge >> difference to the effective performance under high load. >> >> *But when you enable compression... only gets compressed the new data >> modified or entered. Am I wrong?.* >> Compression is per file system data block (at most 1 MB if you set the blocksize to that value). Each such block is compressed independently of = all others, to not require more than 1 block to be read and decompressed when= randomly reading a file. If a block does not shrink when compressed (it m= ay contain compressed file data) the block is written to disk as-is (uncompr= essed). >> >> ** >> >> *By the way, we have more or less 1/4 of each disk used (12 TB allocat= ed in >> a poll stated by zpool list, divided between 8 disks of 8TB...)... do = you >> think we could be suffering on write amplification and so... having a = so >> little disk space used in each disk?.* >> Your use case will cause a lot of garbage collections and this particular= high write amplification values. >> >> Regards, STefan >> >> *Hey mate, your mail is incredible. It has helped as a lot. Can we inv= ite >> you a cup of coffee or a beer through Paypal or similar?. Can I help y= ou in >> some manner?.* >> Thanks, I'm glad to help, and I'd appreciate to hear whether you get your= setup optimized for the purpose (and how well it holds up when you approach the= capacity limits of your drives). I'm always interested in experience of users with different use cases tha= n I have (just being a developer with too much archived mail and media collec= ted over a few decades). Regards, STefan --------------5zqkLuBRyXvluXGRQjtNvCUK Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Am 07.04.22 um 14:30 schrieb egoitz@ramattack.net:
El 2022-04-06 23:49, Stefan Esser escribi=C3=B3:

El 2022-04-06 17:43, Stefan Esser escribi=C3=B3:


Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net:
Hi Rainer!

Thank you so much for your help :) :)

Well I assume they are in a datacenter and should not be a power outage....

About dataset size... yes... our ones are big... they can be 3-4 TB easily each
dataset.....

We bought them, because as they are for mailboxes and mailboxes grow and
grow.... for having space for hosting them...
Which mailbox format (e.g. mbox, maildir, ...) do you use?<= /div>
=C2=A0
I'm running Cyrus imap so sort of Maildir... too many little files normally..... Sometimes directories with tons of little files....

Assuming that many mails are much smaller than the erase block size of the SSD, this may cause issues. (You may know the following ...)

For example, if you have message sizes of 8 KB and an erase block size of 64 KB (just guessing), then 8 mails will be in an erase block. If half the mails are deleted, then the erase block will still occupy 64 KB, but only hold 32 KB of useful data (and the SSD will only be aware of this fact if TRIM has signaled which data is no longer relevant). The SSD will copy several partially filled erase blocks together in a smaller number of free blocks, which then are fully utilized. Later deletions will repeat this game, and your data will be copied multiple times until it has aged (and the user is less likely to delete further messages). This leads to "write amplification" - data is internally moved around and thus written multiple times.


Stefan!! you are nice!= ! I think this could explain all our problem. So, why we are having the most randomness in our performance degradation and that does not necessarily has to match with the most io peak hours... That I could cause that performance degradation just by deleting a couple of huge (perhaps 200.000 mails) mail folders in a middle traffic hour time!!

Yes, if deleting large amounts of data triggers performance issues (and the disk does not have a deficient TRIM implementation), then the issue is likely to be due to internal garbage collections colliding with other operations.

The problem is that by= what I know, erase block size of an SSD disk is something fixed in the disk firmware. I don't really know if perhaps it could be modified with Samsung magician or those kind of tool of Samsung.... else I don't really see the manner of improving it... because apart from that, you are deleting a file in raidz-2 array... no just in a disk... I assume aligning chunk size, with record size and with the "secret" erase size of the ssd, perhaps could be slightly compensated?.

The erase block size is a fixed hardware feature of each flash chip. There is a block size for writes (e.g. 8 KB) and many such blocks are combined in one erase block (of e.g. 64 KB, probably larger in todays SSDs), they can only be returned to the free block pool all together. And if some of these writable blocks hold live data, they must be preserved by collecting them in newly allocated free blocks.

An example of what might happen, showing a simplified layout of files 1, 2, 3 (with writable blocks 1a, 1b, ..., 2a, 2b, ... and "--" for stale data of deleted files, ".." for erased/writable flash blocks) in an SSD might be:

erase block 1: |1a|1b|--|--|2a|--|--|3a|<= /font>

erase block 2; |--|--|--|2b|--|--|--|1c|<= /font>

erase block 3; |2c|1d|3b|3c|--|--|--|--|<= /font>

erase block 4; |..|..|..|..|..|..|..|..|<= /font>

This is just a random example how data could be laid out on the physical storage array. It is assumed that the 3 erase blocks once were completely occupied

In this example, 10 of 32 writable blocks are occupied, and only one free erase block exists.

This situation must not persist, since the SSD needs more empty erase blocks. 10/32 of the capacity is used for data, but 3/4 of the blocks are occupied and not immediately available for new data.

The garbage collection might combine erase blocks 1 and 3 into a currently free one, e.g. erase block 4:

erase block 1; |..|..|..|..|..|..|..|..|

erase block 2; |--|--|--|2b|--|--|--|1c|<= /font>

erase block 3; |..|..|..|..|..|..|..|..|<= /font>

erase block 4: |1a|1b|2a|3a|2c|1d|3b|3c|<= /font>

Now only 2/4 of the capacity is not available for new data (which is still a lot more than 10/32, but better than before).

Now assume file 2 is deleted:

erase block 1; |..|..|..|..|..|..|..|..| =

erase block 2; |--|--|--|--|--|--|--|1c|<= /font>

erase block 3; |..|..|..|..|..|..|..|..|<= /font>

erase block 4: |1a|1b|--|3a|--|1d|3b|3c|<= /font>

There is now a new sparsely used erase block 4, and it will soon need to be garbage collected, too - in fact it could be combined with the live data from erase block 2, but this may be delayed until there is demand for more erased blocks (since e.g. file 1 or 3 might also have been deleted by then).

The garbage collection does not know which data blocks belong to which file, and therefore it cannot collect the data belonging to a file into a single erase block. Blocks are allocated as data comes in (as long as enough SLC cells are available in this area, else directly in QLC cells). Your many parallel updates will cause fractions of each larger file to be spread out over many erase blocks.

As you can see, a single file that is deleted may affect many erase blocks, and you have to take redundancy into consideration, which will multiply the effect by a factor of up to 3 for small files (one ZFS allocation block). And remember: deleting a message in mdir format will free the data blocks, but will also remove the directory entry, causing additional meta-data writes (again multiplied by the raid redundancy).

A consumer SSD would normally see only very few parallel writes, and sequential writes of full files will have a high chance to put the data of each file contiguously in the minimum number of erase blocks, allowing to free multiple complete erase blocks when such a file is deleted and thus obviating the need for many garbage collection copies (that occur if data from several independent files is in one erase block).

Actual SSDs have many more cells than advertised. Some 10% to 20% may be kept as a reserve for aging blocks that e.g. may have failed kind of a "read-after-write test" (implemented in the write function, which adds charges to the cells until they return the correct read-outs).

BTW: Having an ashift value that is lower than the internal write block size may also lead to higher write amplification values, but a large ashift may lead to more wasted capacity, which may become an issue if typical file length are much smaller than the allocation granularity that results from the ashift value.

Larger mails are less of an issue since they span multiple erase blocks, which will be completely freed when such a message is deleted.

I see I see Stefan...<= /span>

Samsung has a lot of experience and generally good strategies to deal with such a situation, but SSDs specified for use in storage systems might be much better suited for that kind of usage profile.

Yes... and the disks for our purpose... perhaps weren't QVOs....=

You should have got (much more expensive) server grade SSDs, IMHO.

But even 4 * 2 TB QVO (or better EVO) drives per each 8 TB QVO drive would result in better performance (but would need a lot of extra SATA ports).

In fact, I'm not sure whether rotating media and a reasonable L2ARC consisting of a fast M.2 SSD plus a mirror of small SSDs for a LOG device would not be a better match for your use case. Reading the L2ARC would be very fast, writes would be purely sequential and relatively slow, you could choose a suitable L2ARC strategy (caching of file data vs. meta data), and the LOG device would support fast fsync() operations required for reliable mail systems (which confirm data is on stable storage before acknowledging the reception to the sender).

We knew they had some speed issues, but those speed issues, we thought (as
Samsung explains in the QVO site) they started after exceeding the speeding
buffer this disks have. We though that meanwhile you didn't exceed it's
capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps
we were wrong?.

These drives are meant for small loads in a typical PC use case,
i.e. some installations of software in the few GB range, else only
files of a few MB being written, perhaps an import of media files
that range from tens to a few hundred MB at a time, but less often
than once a day.
=C2=A0
We move= , you know... lots of little files... and lot's of different concurrent modifications by 1500-2000 concurrent imap connections we have...<= /div>

I do not expect the read load to be a problem (except possibly when the SSD is moving data from SLC to QLC blocks, but even then reads will get priority). But writes and trims might very well overwhelm the SSD, especially when its getting full. Keeping a part of the SSD unused (excluded from the partitions created) will lead to a large pool of unused blocks. This will reduce the write amplification - there are many free blocks in the "unpartitioned part" of the SSD, and thus there is less urgency to compact partially filled blocks. (E.g. if you include only 3/4 of the SSD capacity in a partition used for the ZPOOL, then 1/4 of each erase block could be free due to deletions/TRIM without any compactions required to hold all this data.)

Keeping a significant percentage of the SSD unallocated is a good strategy to improve its performance and resilience.

Well, we have allocate= d all the disk space... but not used... just allocated.... you know... we do a zpool create with the whole disks.....<= /span>

I think the only chance for a solution that does not require new hardware is to make sure, only some 80% of the SSDs are used (i.e. allocate only 80% for ZFS, leave 20% unallocated). This will significantly reduce the rate of garbage collections and thus reduce the load they cause.

I'd use a fast encryption algorithm (zstd - choose a level that does not overwhelm the CPU, there are benchmark results for ZFS with zstd, and I found zstd-2 to be best for my use case). This will more than make up for the space you left unallocated on the SSDs.

A different mail box format might help, too - I'm happy with dovecot's mdbox format, which is as fast but much more efficient than mdir.

As the SSD fills, the space available for the single level write
cache gets smaller
=C2=A0
The single level write cache is the cache these ssd drivers have, for compensating the speed issues they have due to using qlc memory?. Do you refer to that?. Sorry I don't understand well this paragraph.

Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as 24 GB of data in QLC mode.

Ok, true.... yes....

A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (600 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells).

Ahh! you mean that SLC= capacity for speeding up the QLC disks, is obtained from each single layer of the QLC?.

There are no specific SLC cells. A fraction of the QLC capable cells is only written with only 1 instead of 4 bits. This is a much simpler process, since there are only 2 charge levels per cell that are used, while QLC uses 16 charge levels, and you can only add charge (must not overshoot), therefore only small increments are added until the correct value can be read out).

But since SLC cells take away specified capacity (which is calculated assuming all cells hold 4 bits each, not only 1 bit), their number is limited and shrinks as demand for QLC cells grows.<= /p>

The advantage of the SLC cache is fast writes, but also that data in it may have become stale (trimmed) and thus will never be copied over into a QLC block. But as the SSD fills and the size of the SLC cache shrinks, this capability will be mostly lost, and lots of very short lived data is stored in QLC cells, which will quickly become partially stale and thus needing compaction as explained above.

Therefore, the fraction of the cells used as an SLC cache is reduced when it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cells).

Sorry I don't get this= last sentence... don't understand it because I don't really know the meaning of tn...

but I think I'm gettin= g the idea if you say that each QLC layer, has it's own SLC cache obtained from the disk space avaiable for each QLC layer....

And with less SLC cells available for short term storage of data the probability of data being copied to QLC cells before the irrelevant messages have been deleted is significantly increased. And that will again lead to many more blocks with "holes" (deleted messages) in them, which then need to be copied possibly multiple times to compact them.

If I correct above, I think I got the idea yes....

(on many SSDs, I have no numbers for this
particular device), and thus the amount of data that can be
written at single cell speed shrinks as the SSD gets full.<= /div>
=C2=A0

I have just looked up the size of the SLC cache, it is specified
to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB
version, smaller models will have a smaller SLC cache).
=C2=A0
Assumin= g you were talking about the cache for compensating speed we previously commented, I should say these are the 870 QVO but the 8TB version. So they should have the biggest cache for compensating the speed issues...<= /span>

I have looked up the data: the larger versions of the 870 QVO have the same SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB more if there are enough free blocks.

Ours one is the 8TB model so I assume it could have bigger limits. The disks are mostly empty, really.... so... for instance....<= /strong>

zpool list
NAME=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 SIZE=C2=A0 ALLOC=C2=A0=C2=A0 FREE=C2=A0 CKPOINT=C2=A0 EXPANDSZ=C2=A0=C2= =A0 FRAG=C2=A0=C2=A0=C2=A0 CAP=C2=A0 DEDUP=C2=A0 HEALTH=C2=A0 ALTROOT
root_dataset=C2=A0 448G= =C2=A0 2.29G=C2=A0=C2=A0 446G=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -=C2=A0=C2=A0=C2=A0= =C2=A0 1%=C2=A0=C2=A0=C2=A0=C2=A0 0%=C2=A0 1.00x=C2=A0 ONLINE=C2=A0 -
mail_dataset=C2=A0 58.2= T=C2=A0 11.8T=C2=A0 46.4T=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= -=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -=C2=A0=C2=A0=C2=A0 26= %=C2=A0=C2=A0=C2=A0 20%=C2=A0 1.00x=C2=A0 ONLINE=C2=A0 -

Ok, seems you have got 10 * 8 TB in a raidz2 configuration.

Only 20% of the mail dataset is in use, the situation will become much worse when the pool will fill up!

I suppose fragmentatio= n affects too....

On magnetic media fragmentation means that a file is spread out over the disk in a non-optimal way, causing access latencies due to seeks and rotational delay. That kind of fragmentation is not really relevant for SSDs, which allow for fast random access to the cells.

And the FRAG value shown by the "zpool list" command is not about fragmentation of files at all, it is about the structure of free space. Anyway less relevant for SSDs than for classic hard disk drives.

But after writing those few GB at a speed of some 500 MB/s (i.e.
after 12 to 150 seconds), the drive will need several minutes to
transfer those writes to the quad-level cells, and will operate
at a fraction of the nominal performance during that time.<= br> (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the
2 TB model.)
=C2=A0
Well we= are in the 8TB model. I think I have understood what you wrote in previous paragraph. You said they can be fast but not constantly, because later they have to write all that to their perpetual storage from the cache. And that's slow. Am I wrong?. Even in the 8TB model you think Stefan?.

The controller in the SSD supports a given number of channels (e.g 4), each of which can access a Flash chip independently of the others. Small SSDs often have less Flash chips than there are channels (and thus a lower throughput, especially for writes), but the larger models often have more chips than channels and thus the performance is capped.

This is totally logical. If a QVO disk would outperform best or similar than an Intel without consequences.... who was going to buy a expensive Intel enterprise?.

The QVO is bandwidth limited due to the SATA data rate of 6 Mbit/s anyway, and it is optimized for reads (which are not significantly slower than offered by the TLC models). This is a viable concept for a consumer PC, but not for a server.

In the case of the 870 QVO, the controller supports 8 channels, which allows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has only 4 Flash chips and is thus limited to 80 MB/s in that situation, while the larger versions have 8, 16, or 32 chips. But due to the limited number of channels, the write rate is limited to 160 MB/s even for the 8 TB model.

Totally logical Stefan...

If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in this limit.

The mai= n problem we are facing is that in some peak moments, when the machine serves connections for all the instances it has, and only as said in some peak moments... like the 09am or the 11am.... it seems the machine becomes slower... and like if the disks weren't able to serve all they have to serve.... In these moments, no big files are moved... but as we have 1800-2000 concurrent imap connections... normally they are doing each one... little changes in their mailbox. Do you think perhaps this disks then are not appropriate for this kind of usage?-

I'd guess that the drives get into a state in which they have to recycle lots of partially free blocks (i.e. perform kind of a garbage collection) and then three kinds of operations are competing with each other:

  1. reads (generally prioritized)
  2. writes (filling the SLC cache up to its maximum size)
  3. compactions of partially filled blocks (required to make free blocks available for re-use)

Writes can only proceed if there are sufficient free blocks, which on a filled SSD with partially filled erase blocks means that operations of type 3. need to be performed with priority to not stall all writes.

My assumption is that this is what you are observing under peak load.

It could be although the disks are not filled.... the pool are at 20 or 30% of capacity and fragmentation from 20%-30% (as zpool list states).

Yes, and that means that your issues will become much more critical over time when the free space shrinks and garbage collections will be required at an even faster rate, with the SLC cache becoming less and less effective to weed out short lived files as an additional factor that will increase write amplification.
And cheap SSDs often have no RAM cache (not checked, but I'd be
surprised if the QVO had one) and thus cannot keep bookkeeping date
in such a cache, further limiting the performance under load.
=C2=A0
This brochure (https://semiconductor.samsun= g.com/resources/brochure/870_Series_Brochure.pdf and the datasheet https://semiconductor.samsun= g.com/resources/data-sheet/Samsung_SSD_870_QVO_Data_Sheet_Rev1.1.pdf)= sais if I have read properly, the 8TB drive has 8GB of ram?. I assume that is what they call the turbo write cache?.

No, the turbo write cache consists of the cells used in SLC mode (which can be any cells, not only cells in a specific area of the flash chip).

I see I see....=

The RAM is needed for fast lookup of the position of data for reads and of free blocks for writes.

Our ones... seem to have 8GB LPDDR4 of ram.... as datasheet states....

Yes, and it makes sense that the RAM size is proportional to the capacity since a few bytes are required per addressable data block.

If the block size was 8 KB the RAM could hold 8 bytes (e.g. a pointer and some status flags) for each logically addressable block. But there is no information about the actual internal structure of the QVO that I know of.

[...]

I see.... It's extremely misleading you know... because... you can copy five mailboxes of 50GB concurrently for instance.... and you flood a gigabit interface copying (obviously because disks can keep that throughput)... but later.... you see... you are in an hour that yesterday, and even 4 days before you have not had any issues... and that day... you see the commented issue... even not being exactly at a peak hour (perhaps is two hours later the peak hour even)... or... but I wasn't noticing about all things you say in this email....

I have seen advice to not use compression in a high load scenario in some other reply.

I tend to disagree: Since you seem to be limited when the SLC cache is exhausted, you should get better performance if you compress your data. I have found that zstd-2 works well for me (giving a significant overall reduction of size at reasonable additional CPU load). Since ZFS allows to switch compressions algorithms at any time, you can experiment with different algorithms and levels.

I see... you say compression should be enabled.... The main reason because we have not enabled it yet, is for keeping the system the most near possible to config defaults... you know... for later being able to ask in this mailing lists if we have an issue... because you know... it would be far more easier to ask about something strange you are seeing when that strange thing is near to a well tested config, like the config by default....

But now you say Stefan... if you switch between compression algorithms you will end up with a mix of different files compressed in a different manner... that is not a bit disaster later?. Doesn't affect performance in some manner?.=

The compression used is stored in the per file information, each file in a dataset could have been written with a different compression method and level. Blocks are independently compressed - a file level compression may be more effective. Large mail files will contain incompressible attachments (already compressed), but in base64 encoding. This should allow a compression ratio of ~1,3. Small files will be plain text or HTML, offering much better compression factors.

One advantage of ZFS compression is that it applies to the ARC, too. And a compression factor of 2 should easily be achieved when storing mail (not for .docx, .pdf, .jpg files though). Having more data in the ARC will reduce the read pressure on the SSDs and will give them more cycles for garbage collections (which are performed in the background and required to always have a sufficient reserve of free flash blocks for writes).

We would use I assume the lz4... which is the less "expensive" compression algorithm for the CPU... and I assume too for avoiding delay accessing data... do you recommend another one?. Do you always recommend compression then?.

=

I'd prefer zstd over lz4 since it offers a much higher compression ratio.

Zstd offers higher compression ratios than lz4 at similar or better decompression speed, but may be somewhat slower compressing the data. But in my opinion this is outweighed by the higher effective amount of data in the ARC/L2ARC possible with zstd.

For some benchmarks of different compression algorithms available for ZFS and compared to uncompressed mode see the extensive results published by Jude Allan:

https://docs.google.com/sprea=
dsheets/d/1TvCAIDzFsjuLuea7124q-1UtMd0C9amTgnXm2yPtiUQ/edit?usp=3Dsharing=


The SQL benchmarks might best resemble your use case - but remember that =
a significant reduction of the amount of data being written to the SSDs m=
ight be more important than the highest transaction rate, since your SSDs=
 put a low upper limit on that when highly loaded.

I'd give it a try - and if it reduces your storage requirements by 10% only, then keep 10% of each SSD unused (not assigned to any partition). That will greatly improve the resilience of your SSDs, reduce the write-amplification, will allow the SLC cache to stay at its large value, and may make a large difference to the effective performance under high load.<= /p>

But when you enable compression... only gets compressed the new data modified or entered. Am I wrong?.

Compression is per file system data block (at most 1 MB if you set the blocksize to that value). Each such block is compressed independently of all others, to not require more than 1 block to be read and decompressed when randomly reading a file. If a block does not shrink when compressed (it may contain compressed file data) the block is written to disk as-is (uncompressed).

By the way, we have more or less 1/4 of each disk used (12 TB allocated in a poll stated by zpool list, divided between 8 disks of 8TB...)... do you think we could be suffering on write amplification and so... having a so little disk space used in each disk?.

Your use case will cause a lot of garbage collections and this particular high write amplification values.

Regards, STefan

Hey mate, your mail is= incredible. It has helped as a lot. Can we invite you a cup of coffee or a beer through Paypal or similar?. Can I help you in some manner?.

Thanks, I'm glad to help, and I'd appreciate to hear whether you get your setup optimized for the purpose (and how well it holds up when you approach the capacity limits of your drives).

I'm always interested in experience of users with different use cases than I have (just being a developer with too much archived mail and media collected over a few decades).

Regards, STefan

--------------5zqkLuBRyXvluXGRQjtNvCUK-- --------------O0ssWPEiEQcQgZRJEKUGD0ko-- --------------TD2ASvjzfpo4JyxrFz8JyiVP Content-Type: application/pgp-signature; name="OpenPGP_signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="OpenPGP_signature" -----BEGIN PGP SIGNATURE----- wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmJQGRAFAwAAAAAACgkQR+u171r99UTJ 4gf6A4flCtMFuYpldXh1g1ln+Nio4LQGooOn69VSJw4KhihTBqy5ZR8scfhKetf8/miSx/0Akvsc WqZA9bEy67LXbUCekfbuUXQdO8ikXY1H64fecl4ZQZwItnIacKKD6TEIuBDe5sda0N+S2n7mNE/N d3EhNEyQTBOVvOx4vHdQaz+xAR6FFXstc14bs6BaSjROUndk21zO2IE8KMXkxH4RiWqoRuny3Po7 Uz3q0/+rcPSe9GHLw3BOHbvo89NdCFwlgBv5CXDMnEqqeW7ECdZiHn14XL/F30L8zCi2c3eTbD4Q AJeoKbBsmzSaud3opYrv2oCr8UOWiSLEJOkYAyvJeA== =tDUQ -----END PGP SIGNATURE----- --------------TD2ASvjzfpo4JyxrFz8JyiVP-- From nobody Fri Apr 8 17:41:11 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id E838E1A879CD; Fri, 8 Apr 2022 17:41:22 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZlss2FNpz3j5J; Fri, 8 Apr 2022 17:41:20 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id A585B60C13A; Fri, 8 Apr 2022 19:41:11 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_fe85fe2db536d584a8585a29da09a2df" Date: Fri, 08 Apr 2022 19:41:11 +0200 From: egoitz@ramattack.net To: Stefan Esser Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner Subject: Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> Message-ID: <3d24c87110b4a155e3f14d53a9309c61@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KZlss2FNpz3j5J X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_fe85fe2db536d584a8585a29da09a2df Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Stefan, Again extremely grateful. It's an absolute honor to receive your help.. really.... I have read this mail now but I need to read it slower and in a more relaxed way.... When I do that I'll answer you (during the weekend or on Monday at most). Don't worry I will keep you updated with news :) :) . I promise :) :) Cheers! El 2022-04-08 13:14, Stefan Esser escribió: > ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > Am 07.04.22 um 14:30 schrieb egoitz@ramattack.net: El 2022-04-06 23:49, Stefan Esser escribió: > > El 2022-04-06 17:43, Stefan Esser escribió: > > Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net: Hi Rainer! > > Thank you so much for your help :) :) > > Well I assume they are in a datacenter and should not be a power outage.... > > About dataset size... yes... our ones are big... they can be 3-4 TB easily each > dataset..... > > We bought them, because as they are for mailboxes and mailboxes grow and > grow.... for having space for hosting them... > Which mailbox format (e.g. mbox, maildir, ...) do you use? > > I'M RUNNING CYRUS IMAP SO SORT OF MAILDIR... TOO MANY LITTLE FILES NORMALLY..... SOMETIMES DIRECTORIES WITH TONS OF LITTLE FILES.... Assuming that many mails are much smaller than the erase block size of the SSD, this may cause issues. (You may know the following ...) For example, if you have message sizes of 8 KB and an erase block size of 64 KB (just guessing), then 8 mails will be in an erase block. If half the mails are deleted, then the erase block will still occupy 64 KB, but only hold 32 KB of useful data (and the SSD will only be aware of this fact if TRIM has signaled which data is no longer relevant). The SSD will copy several partially filled erase blocks together in a smaller number of free blocks, which then are fully utilized. Later deletions will repeat this game, and your data will be copied multiple times until it has aged (and the user is less likely to delete further messages). This leads to "write amplification" - data is internally moved around and thus written multiple times. STEFAN!! YOU ARE NICE!! I THINK THIS COULD EXPLAIN ALL OUR PROBLEM. SO, WHY WE ARE HAVING THE MOST RANDOMNESS IN OUR PERFORMANCE DEGRADATION AND THAT DOES NOT NECESSARILY HAS TO MATCH WITH THE MOST IO PEAK HOURS... THAT I COULD CAUSE THAT PERFORMANCE DEGRADATION JUST BY DELETING A COUPLE OF HUGE (PERHAPS 200.000 MAILS) MAIL FOLDERS IN A MIDDLE TRAFFIC HOUR TIME!! Yes, if deleting large amounts of data triggers performance issues (and the disk does not have a deficient TRIM implementation), then the issue is likely to be due to internal garbage collections colliding with other operations. >> THE PROBLEM IS THAT BY WHAT I KNOW, ERASE BLOCK SIZE OF AN SSD DISK IS SOMETHING FIXED IN THE DISK FIRMWARE. I DON'T REALLY KNOW IF PERHAPS IT COULD BE MODIFIED WITH SAMSUNG MAGICIAN OR THOSE KIND OF TOOL OF SAMSUNG.... ELSE I DON'T REALLY SEE THE MANNER OF IMPROVING IT... BECAUSE APART FROM THAT, YOU ARE DELETING A FILE IN RAIDZ-2 ARRAY... NO JUST IN A DISK... I ASSUME ALIGNING CHUNK SIZE, WITH RECORD SIZE AND WITH THE "SECRET" ERASE SIZE OF THE SSD, PERHAPS COULD BE SLIGHTLY COMPENSATED?. The erase block size is a fixed hardware feature of each flash chip. There is a block size for writes (e.g. 8 KB) and many such blocks are combined in one erase block (of e.g. 64 KB, probably larger in todays SSDs), they can only be returned to the free block pool all together. And if some of these writable blocks hold live data, they must be preserved by collecting them in newly allocated free blocks. An example of what might happen, showing a simplified layout of files 1, 2, 3 (with writable blocks 1a, 1b, ..., 2a, 2b, ... and "--" for stale data of deleted files, ".." for erased/writable flash blocks) in an SSD might be: erase block 1: |1a|1b|--|--|2a|--|--|3a| erase block 2; |--|--|--|2b|--|--|--|1c| erase block 3; |2c|1d|3b|3c|--|--|--|--| erase block 4; |..|..|..|..|..|..|..|..| This is just a random example how data could be laid out on the physical storage array. It is assumed that the 3 erase blocks once were completely occupied In this example, 10 of 32 writable blocks are occupied, and only one free erase block exists. This situation must not persist, since the SSD needs more empty erase blocks. 10/32 of the capacity is used for data, but 3/4 of the blocks are occupied and not immediately available for new data. The garbage collection might combine erase blocks 1 and 3 into a currently free one, e.g. erase block 4: erase block 1; |..|..|..|..|..|..|..|..| erase block 2; |--|--|--|2b|--|--|--|1c| erase block 3; |..|..|..|..|..|..|..|..| erase block 4: |1a|1b|2a|3a|2c|1d|3b|3c| Now only 2/4 of the capacity is not available for new data (which is still a lot more than 10/32, but better than before). Now assume file 2 is deleted: erase block 1; |..|..|..|..|..|..|..|..| erase block 2; |--|--|--|--|--|--|--|1c| erase block 3; |..|..|..|..|..|..|..|..| erase block 4: |1a|1b|--|3a|--|1d|3b|3c| There is now a new sparsely used erase block 4, and it will soon need to be garbage collected, too - in fact it could be combined with the live data from erase block 2, but this may be delayed until there is demand for more erased blocks (since e.g. file 1 or 3 might also have been deleted by then). The garbage collection does not know which data blocks belong to which file, and therefore it cannot collect the data belonging to a file into a single erase block. Blocks are allocated as data comes in (as long as enough SLC cells are available in this area, else directly in QLC cells). Your many parallel updates will cause fractions of each larger file to be spread out over many erase blocks. As you can see, a single file that is deleted may affect many erase blocks, and you have to take redundancy into consideration, which will multiply the effect by a factor of up to 3 for small files (one ZFS allocation block). And remember: deleting a message in mdir format will free the data blocks, but will also remove the directory entry, causing additional meta-data writes (again multiplied by the raid redundancy). A consumer SSD would normally see only very few parallel writes, and sequential writes of full files will have a high chance to put the data of each file contiguously in the minimum number of erase blocks, allowing to free multiple complete erase blocks when such a file is deleted and thus obviating the need for many garbage collection copies (that occur if data from several independent files is in one erase block). Actual SSDs have many more cells than advertised. Some 10% to 20% may be kept as a reserve for aging blocks that e.g. may have failed kind of a "read-after-write test" (implemented in the write function, which adds charges to the cells until they return the correct read-outs). BTW: Having an ashift value that is lower than the internal write block size may also lead to higher write amplification values, but a large ashift may lead to more wasted capacity, which may become an issue if typical file length are much smaller than the allocation granularity that results from the ashift value. >> Larger mails are less of an issue since they span multiple erase blocks, which will be completely freed when such a message is deleted. >> >> I SEE I SEE STEFAN... >> >> Samsung has a lot of experience and generally good strategies to deal with such a situation, but SSDs specified for use in storage systems might be much better suited for that kind of usage profile. >> >> YES... AND THE DISKS FOR OUR PURPOSE... PERHAPS WEREN'T QVOS.... You should have got (much more expensive) server grade SSDs, IMHO. But even 4 * 2 TB QVO (or better EVO) drives per each 8 TB QVO drive would result in better performance (but would need a lot of extra SATA ports). In fact, I'm not sure whether rotating media and a reasonable L2ARC consisting of a fast M.2 SSD plus a mirror of small SSDs for a LOG device would not be a better match for your use case. Reading the L2ARC would be very fast, writes would be purely sequential and relatively slow, you could choose a suitable L2ARC strategy (caching of file data vs. meta data), and the LOG device would support fast fsync() operations required for reliable mail systems (which confirm data is on stable storage before acknowledging the reception to the sender). > We knew they had some speed issues, but those speed issues, we thought (as > Samsung explains in the QVO site) they started after exceeding the speeding > buffer this disks have. We though that meanwhile you didn't exceed it's > capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps > we were wrong?. > These drives are meant for small loads in a typical PC use case, > i.e. some installations of software in the few GB range, else only > files of a few MB being written, perhaps an import of media files > that range from tens to a few hundred MB at a time, but less often > than once a day. > > WE MOVE, YOU KNOW... LOTS OF LITTLE FILES... AND LOT'S OF DIFFERENT CONCURRENT MODIFICATIONS BY 1500-2000 CONCURRENT IMAP CONNECTIONS WE HAVE... I do not expect the read load to be a problem (except possibly when the SSD is moving data from SLC to QLC blocks, but even then reads will get priority). But writes and trims might very well overwhelm the SSD, especially when its getting full. Keeping a part of the SSD unused (excluded from the partitions created) will lead to a large pool of unused blocks. This will reduce the write amplification - there are many free blocks in the "unpartitioned part" of the SSD, and thus there is less urgency to compact partially filled blocks. (E.g. if you include only 3/4 of the SSD capacity in a partition used for the ZPOOL, then 1/4 of each erase block could be free due to deletions/TRIM without any compactions required to hold all this data.) Keeping a significant percentage of the SSD unallocated is a good strategy to improve its performance and resilience. WELL, WE HAVE ALLOCATED ALL THE DISK SPACE... BUT NOT USED... JUST ALLOCATED.... YOU KNOW... WE DO A ZPOOL CREATE WITH THE WHOLE DISKS..... I think the only chance for a solution that does not require new hardware is to make sure, only some 80% of the SSDs are used (i.e. allocate only 80% for ZFS, leave 20% unallocated). This will significantly reduce the rate of garbage collections and thus reduce the load they cause. I'd use a fast encryption algorithm (zstd - choose a level that does not overwhelm the CPU, there are benchmark results for ZFS with zstd, and I found zstd-2 to be best for my use case). This will more than make up for the space you left unallocated on the SSDs. A different mail box format might help, too - I'm happy with dovecot's mdbox format, which is as fast but much more efficient than mdir. > As the SSD fills, the space available for the single level write > cache gets smaller > > THE SINGLE LEVEL WRITE CACHE IS THE CACHE THESE SSD DRIVERS HAVE, FOR COMPENSATING THE SPEED ISSUES THEY HAVE DUE TO USING QLC MEMORY?. DO YOU REFER TO THAT?. SORRY I DON'T UNDERSTAND WELL THIS PARAGRAPH. Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as 24 GB of data in QLC mode. OK, TRUE.... YES.... A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (600 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells). AHH! YOU MEAN THAT SLC CAPACITY FOR SPEEDING UP THE QLC DISKS, IS OBTAINED FROM EACH SINGLE LAYER OF THE QLC?. There are no specific SLC cells. A fraction of the QLC capable cells is only written with only 1 instead of 4 bits. This is a much simpler process, since there are only 2 charge levels per cell that are used, while QLC uses 16 charge levels, and you can only add charge (must not overshoot), therefore only small increments are added until the correct value can be read out). But since SLC cells take away specified capacity (which is calculated assuming all cells hold 4 bits each, not only 1 bit), their number is limited and shrinks as demand for QLC cells grows. The advantage of the SLC cache is fast writes, but also that data in it may have become stale (trimmed) and thus will never be copied over into a QLC block. But as the SSD fills and the size of the SLC cache shrinks, this capability will be mostly lost, and lots of very short lived data is stored in QLC cells, which will quickly become partially stale and thus needing compaction as explained above. > Therefore, the fraction of the cells used as an SLC cache is reduced when it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cells). > > SORRY I DON'T GET THIS LAST SENTENCE... DON'T UNDERSTAND IT BECAUSE I DON'T REALLY KNOW THE MEANING OF TN... > > BUT I THINK I'M GETTING THE IDEA IF YOU SAY THAT EACH QLC LAYER, HAS IT'S OWN SLC CACHE OBTAINED FROM THE DISK SPACE AVAIABLE FOR EACH QLC LAYER.... > > And with less SLC cells available for short term storage of data the probability of data being copied to QLC cells before the irrelevant messages have been deleted is significantly increased. And that will again lead to many more blocks with "holes" (deleted messages) in them, which then need to be copied possibly multiple times to compact them. > > IF I CORRECT ABOVE, I THINK I GOT THE IDEA YES.... (on many SSDs, I have no numbers for this > particular device), and thus the amount of data that can be > written at single cell speed shrinks as the SSD gets full. > > I have just looked up the size of the SLC cache, it is specified > to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB > version, smaller models will have a smaller SLC cache). > > ASSUMING YOU WERE TALKING ABOUT THE CACHE FOR COMPENSATING SPEED WE PREVIOUSLY COMMENTED, I SHOULD SAY THESE ARE THE 870 QVO BUT THE 8TB VERSION. SO THEY SHOULD HAVE THE BIGGEST CACHE FOR COMPENSATING THE SPEED ISSUES... I have looked up the data: the larger versions of the 870 QVO have the same SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB more if there are enough free blocks. OURS ONE IS THE 8TB MODEL SO I ASSUME IT COULD HAVE BIGGER LIMITS. THE DISKS ARE MOSTLY EMPTY, REALLY.... SO... FOR INSTANCE.... ZPOOL LIST NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT ROOT_DATASET 448G 2.29G 446G - - 1% 0% 1.00X ONLINE - MAIL_DATASET 58.2T 11.8T 46.4T - - 26% 20% 1.00X ONLINE - Ok, seems you have got 10 * 8 TB in a raidz2 configuration. Only 20% of the mail dataset is in use, the situation will become much worse when the pool will fill up! >> I SUPPOSE FRAGMENTATION AFFECTS TOO.... On magnetic media fragmentation means that a file is spread out over the disk in a non-optimal way, causing access latencies due to seeks and rotational delay. That kind of fragmentation is not really relevant for SSDs, which allow for fast random access to the cells. And the FRAG value shown by the "zpool list" command is not about fragmentation of files at all, it is about the structure of free space. Anyway less relevant for SSDs than for classic hard disk drives. > But after writing those few GB at a speed of some 500 MB/s (i.e. > after 12 to 150 seconds), the drive will need several minutes to > transfer those writes to the quad-level cells, and will operate > at a fraction of the nominal performance during that time. > (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the > 2 TB model.) > > WELL WE ARE IN THE 8TB MODEL. I THINK I HAVE UNDERSTOOD WHAT YOU WROTE IN PREVIOUS PARAGRAPH. YOU SAID THEY CAN BE FAST BUT NOT CONSTANTLY, BECAUSE LATER THEY HAVE TO WRITE ALL THAT TO THEIR PERPETUAL STORAGE FROM THE CACHE. AND THAT'S SLOW. AM I WRONG?. EVEN IN THE 8TB MODEL YOU THINK STEFAN?. The controller in the SSD supports a given number of channels (e.g 4), each of which can access a Flash chip independently of the others. Small SSDs often have less Flash chips than there are channels (and thus a lower throughput, especially for writes), but the larger models often have more chips than channels and thus the performance is capped. THIS IS TOTALLY LOGICAL. IF A QVO DISK WOULD OUTPERFORM BEST OR SIMILAR THAN AN INTEL WITHOUT CONSEQUENCES.... WHO WAS GOING TO BUY A EXPENSIVE INTEL ENTERPRISE?. The QVO is bandwidth limited due to the SATA data rate of 6 Mbit/s anyway, and it is optimized for reads (which are not significantly slower than offered by the TLC models). This is a viable concept for a consumer PC, but not for a server. > In the case of the 870 QVO, the controller supports 8 channels, which allows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has only 4 Flash chips and is thus limited to 80 MB/s in that situation, while the larger versions have 8, 16, or 32 chips. But due to the limited number of channels, the write rate is limited to 160 MB/s even for the 8 TB model. > > TOTALLY LOGICAL STEFAN... > > If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in this limit. > THE MAIN PROBLEM WE ARE FACING IS THAT IN SOME PEAK MOMENTS, WHEN THE MACHINE SERVES CONNECTIONS FOR ALL THE INSTANCES IT HAS, AND ONLY AS SAID IN SOME PEAK MOMENTS... LIKE THE 09AM OR THE 11AM.... IT SEEMS THE MACHINE BECOMES SLOWER... AND LIKE IF THE DISKS WEREN'T ABLE TO SERVE ALL THEY HAVE TO SERVE.... IN THESE MOMENTS, NO BIG FILES ARE MOVED... BUT AS WE HAVE 1800-2000 CONCURRENT IMAP CONNECTIONS... NORMALLY THEY ARE DOING EACH ONE... LITTLE CHANGES IN THEIR MAILBOX. DO YOU THINK PERHAPS THIS DISKS THEN ARE NOT APPROPRIATE FOR THIS KIND OF USAGE?- I'd guess that the drives get into a state in which they have to recycle lots of partially free blocks (i.e. perform kind of a garbage collection) and then three kinds of operations are competing with each other: * reads (generally prioritized) * writes (filling the SLC cache up to its maximum size) * compactions of partially filled blocks (required to make free blocks available for re-use) Writes can only proceed if there are sufficient free blocks, which on a filled SSD with partially filled erase blocks means that operations of type 3. need to be performed with priority to not stall all writes. My assumption is that this is what you are observing under peak load. IT COULD BE ALTHOUGH THE DISKS ARE NOT FILLED.... THE POOL ARE AT 20 OR 30% OF CAPACITY AND FRAGMENTATION FROM 20%-30% (AS ZPOOL LIST STATES). Yes, and that means that your issues will become much more critical over time when the free space shrinks and garbage collections will be required at an even faster rate, with the SLC cache becoming less and less effective to weed out short lived files as an additional factor that will increase write amplification. > And cheap SSDs often have no RAM cache (not checked, but I'd be > surprised if the QVO had one) and thus cannot keep bookkeeping date > in such a cache, further limiting the performance under load. > > THIS BROCHURE (HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/BROCHURE/870_SERIES_BROCHURE.PDF AND THE DATASHEET HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/DATA-SHEET/SAMSUNG_SSD_870_QVO_DATA_SHEET_REV1.1.PDF) SAIS IF I HAVE READ PROPERLY, THE 8TB DRIVE HAS 8GB OF RAM?. I ASSUME THAT IS WHAT THEY CALL THE TURBO WRITE CACHE?. No, the turbo write cache consists of the cells used in SLC mode (which can be any cells, not only cells in a specific area of the flash chip). I SEE I SEE.... The RAM is needed for fast lookup of the position of data for reads and of free blocks for writes. OUR ONES... SEEM TO HAVE 8GB LPDDR4 OF RAM.... AS DATASHEET STATES.... Yes, and it makes sense that the RAM size is proportional to the capacity since a few bytes are required per addressable data block. If the block size was 8 KB the RAM could hold 8 bytes (e.g. a pointer and some status flags) for each logically addressable block. But there is no information about the actual internal structure of the QVO that I know of. [...] >> I SEE.... IT'S EXTREMELY MISLEADING YOU KNOW... BECAUSE... YOU CAN COPY FIVE MAILBOXES OF 50GB CONCURRENTLY FOR INSTANCE.... AND YOU FLOOD A GIGABIT INTERFACE COPYING (OBVIOUSLY BECAUSE DISKS CAN KEEP THAT THROUGHPUT)... BUT LATER.... YOU SEE... YOU ARE IN AN HOUR THAT YESTERDAY, AND EVEN 4 DAYS BEFORE YOU HAVE NOT HAD ANY ISSUES... AND THAT DAY... YOU SEE THE COMMENTED ISSUE... EVEN NOT BEING EXACTLY AT A PEAK HOUR (PERHAPS IS TWO HOURS LATER THE PEAK HOUR EVEN)... OR... BUT I WASN'T NOTICING ABOUT ALL THINGS YOU SAY IN THIS EMAIL.... >> >> I have seen advice to not use compression in a high load scenario in some other reply. >> >> I tend to disagree: Since you seem to be limited when the SLC cache is exhausted, you should get better performance if you compress your data. I have found that zstd-2 works well for me (giving a significant overall reduction of size at reasonable additional CPU load). Since ZFS allows to switch compressions algorithms at any time, you can experiment with different algorithms and levels. >> >> I SEE... YOU SAY COMPRESSION SHOULD BE ENABLED.... THE MAIN REASON BECAUSE WE HAVE NOT ENABLED IT YET, IS FOR KEEPING THE SYSTEM THE MOST NEAR POSSIBLE TO CONFIG DEFAULTS... YOU KNOW... FOR LATER BEING ABLE TO ASK IN THIS MAILING LISTS IF WE HAVE AN ISSUE... BECAUSE YOU KNOW... IT WOULD BE FAR MORE EASIER TO ASK ABOUT SOMETHING STRANGE YOU ARE SEEING WHEN THAT STRANGE THING IS NEAR TO A WELL TESTED CONFIG, LIKE THE CONFIG BY DEFAULT.... >> >> BUT NOW YOU SAY STEFAN... IF YOU SWITCH BETWEEN COMPRESSION ALGORITHMS YOU WILL END UP WITH A MIX OF DIFFERENT FILES COMPRESSED IN A DIFFERENT MANNER... THAT IS NOT A BIT DISASTER LATER?. DOESN'T AFFECT PERFORMANCE IN SOME MANNER?. The compression used is stored in the per file information, each file in a dataset could have been written with a different compression method and level. Blocks are independently compressed - a file level compression may be more effective. Large mail files will contain incompressible attachments (already compressed), but in base64 encoding. This should allow a compression ratio of ~1,3. Small files will be plain text or HTML, offering much better compression factors. >> One advantage of ZFS compression is that it applies to the ARC, too. And a compression factor of 2 should easily be achieved when storing mail (not for .docx, .pdf, .jpg files though). Having more data in the ARC will reduce the read pressure on the SSDs and will give them more cycles for garbage collections (which are performed in the background and required to always have a sufficient reserve of free flash blocks for writes). >> >> WE WOULD USE I ASSUME THE LZ4... WHICH IS THE LESS "EXPENSIVE" COMPRESSION ALGORITHM FOR THE CPU... AND I ASSUME TOO FOR AVOIDING DELAY ACCESSING DATA... DO YOU RECOMMEND ANOTHER ONE?. DO YOU ALWAYS RECOMMEND COMPRESSION THEN?. I'd prefer zstd over lz4 since it offers a much higher compression ratio. Zstd offers higher compression ratios than lz4 at similar or better decompression speed, but may be somewhat slower compressing the data. But in my opinion this is outweighed by the higher effective amount of data in the ARC/L2ARC possible with zstd. For some benchmarks of different compression algorithms available for ZFS and compared to uncompressed mode see the extensive results published by Jude Allan: https://docs.google.com/spreadsheets/d/1TvCAIDzFsjuLuea7124q-1UtMd0C9amTgnXm2yPtiUQ/edit?usp=sharing The SQL benchmarks might best resemble your use case - but remember that a significant reduction of the amount of data being written to the SSDs might be more important than the highest transaction rate, since your SSDs put a low upper limit on that when highly loaded. >> I'd give it a try - and if it reduces your storage requirements by 10% only, then keep 10% of each SSD unused (not assigned to any partition). That will greatly improve the resilience of your SSDs, reduce the write-amplification, will allow the SLC cache to stay at its large value, and may make a large difference to the effective performance under high load. >> >> BUT WHEN YOU ENABLE COMPRESSION... ONLY GETS COMPRESSED THE NEW DATA MODIFIED OR ENTERED. AM I WRONG?. Compression is per file system data block (at most 1 MB if you set the blocksize to that value). Each such block is compressed independently of all others, to not require more than 1 block to be read and decompressed when randomly reading a file. If a block does not shrink when compressed (it may contain compressed file data) the block is written to disk as-is (uncompressed). >> BY THE WAY, WE HAVE MORE OR LESS 1/4 OF EACH DISK USED (12 TB ALLOCATED IN A POLL STATED BY ZPOOL LIST, DIVIDED BETWEEN 8 DISKS OF 8TB...)... DO YOU THINK WE COULD BE SUFFERING ON WRITE AMPLIFICATION AND SO... HAVING A SO LITTLE DISK SPACE USED IN EACH DISK?. Your use case will cause a lot of garbage collections and this particular high write amplification values. >> Regards, STefan >> >> HEY MATE, YOUR MAIL IS INCREDIBLE. IT HAS HELPED AS A LOT. CAN WE INVITE YOU A CUP OF COFFEE OR A BEER THROUGH PAYPAL OR SIMILAR?. CAN I HELP YOU IN SOME MANNER?. Thanks, I'm glad to help, and I'd appreciate to hear whether you get your setup optimized for the purpose (and how well it holds up when you approach the capacity limits of your drives). I'm always interested in experience of users with different use cases than I have (just being a developer with too much archived mail and media collected over a few decades). Regards, STefan --=_fe85fe2db536d584a8585a29da09a2df Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Stefan,


Again extremely grateful. It's an absolute honor to receive your help.= =2E really....


I have read this mail now but I need to read it slower and in a more rel= axed way.... When I do that I'll answer you (during the weekend or on Monda= y at most).


Don't worry I will keep you updated with news :) :) . I promise :) :)


Cheers!

 


El 2022-04-08 13:14, Stefan Esser escribió:


ATENCION: Este correo se ha enviado = desde fuera de la organización. No pinche en los enlaces ni abra los= adjuntos a no ser que reconozca el remitente y sepa que el contenido es se= guro.

Am 07.04.22 um 14:30 schrieb egoitz@ramattack.net:
El 2022-04-06 23:49, Stefan Esser escribió:

El 2022-04-06 17:43, Stefan Esser escribió:


Am 06.04.22 um 16:36 schrieb egoitz@ramattack= =2Enet:
Hi Rainer!

Thank you so much for your help :) :)
=
Well I assume they are in a datacenter and should not be a power ou= tage....

About dataset size... yes... our ones are big... they= can be 3-4 TB easily each
dataset.....

We bought them, = because as they are for mailboxes and mailboxes grow and
grow.... for= having space for hosting them...

Which mailbox format (e.g. mbox, maildir, ...) do you use?
 
I'm running Cyrus imap so sort of = Maildir... too many little files normally..... Sometimes directories with t= ons of little files....

Assuming that many mails are much smaller than the erase block size of t= he SSD, this may cause issues. (You may know the following ...)

For example, if you have message sizes of 8 KB and an erase block size o= f 64 KB (just guessing), then 8 mails will be in an erase block. If half th= e mails are deleted, then the erase block will still occupy 64 KB, but only= hold 32 KB of useful data (and the SSD will only be aware of this fact if = TRIM has signaled which data is no longer relevant). The SSD will copy seve= ral partially filled erase blocks together in a smaller number of free bloc= ks, which then are fully utilized. Later deletions will repeat this game, a= nd your data will be copied multiple times until it has aged (and the user = is less likely to delete further messages). This leads to "write amplificat= ion" - data is internally moved around and thus written multiple times.


Stefan!! you are nice!! I think = this could explain all our problem. So, why we are having the most randomne= ss in our performance degradation and that does not necessarily has to matc= h with the most io peak hours... That I could cause that performance degrad= ation just by deleting a couple of huge (perhaps 200.000 mails) mail folder= s in a middle traffic hour time!!

Yes, if deleting large amounts of data triggers performance issues (and the= disk does not have a deficient TRIM implementation), then the issue is lik= ely to be due to internal garbage collections colliding with other operatio= ns.

The problem is that by what I kn= ow, erase block size of an SSD disk is something fixed in the disk firmware= =2E I don't really know if perhaps it could be modified with Samsung magici= an or those kind of tool of Samsung.... else I don't really see the manner = of improving it... because apart from that, you are deleting a file in raid= z-2 array... no just in a disk... I assume aligning chunk size, with record= size and with the "secret" erase size of the ssd, perhaps could be slightl= y compensated?.

The erase block size is a fixed hardware feature of each flash chip. The= re is a block size for writes (e.g. 8 KB) and many such blocks are combined= in one erase block (of e.g. 64 KB, probably larger in todays SSDs), they c= an only be returned to the free block pool all together. And if some of the= se writable blocks hold live data, they must be preserved by collecting the= m in newly allocated free blocks.

An example of what might happen, showing a simplified layout of files 1,= 2, 3 (with writable blocks 1a, 1b, ..., 2a, 2b, ... and "--" for stale dat= a of deleted files, ".." for erased/writable flash blocks) in an SSD might = be:

erase block 1: |1a|1b|--|--|2a|-= -|--|3a|

erase block 2; |--|--|--|2b|--|-= -|--|1c|

erase block 3; |2c|1d|3b|3c|--|-= -|--|--|

erase block 4; |..|..|..|..|..|= =2E.|..|..|

This is just a random example how data could be laid out on the physical= storage array. It is assumed that the 3 erase blocks once were completely = occupied

In this example, 10 of 32 writable blocks are occupied, and only one fre= e erase block exists.

This situation must not persist, since the SSD needs more empty erase bl= ocks. 10/32 of the capacity is used for data, but 3/4 of the blocks are occ= upied and not immediately available for new data.

The garbage collection might combine erase blocks 1 and 3 into a current= ly free one, e.g. erase block 4:

erase block 1; |..|..|..|..|..|..|= =2E.|..|

erase block 2; |--|--|--|2b|--|-= -|--|1c|

erase block 3; |..|..|..|..|..|= =2E.|..|..|

erase block 4: |1a|1b|2a|3a|2c|1= d|3b|3c|

Now only 2/4 of the capacity is not available for new data (which is sti= ll a lot more than 10/32, but better than before).

Now assume file 2 is deleted:

erase block 1; |..|..|..|..|..|= =2E.|..|..|

erase block 2; |--|--|--|--|--|-= -|--|1c|

erase block 3; |..|..|..|..|..|= =2E.|..|..|

erase block 4: |1a|1b|--|3a|--|1= d|3b|3c|

There is now a new sparsely used erase block 4, and it will soon need to= be garbage collected, too - in fact it could be combined with the live dat= a from erase block 2, but this may be delayed until there is demand for mor= e erased blocks (since e.g. file 1 or 3 might also have been deleted by the= n).

The garbage collection does not know which data blocks belong to which f= ile, and therefore it cannot collect the data belonging to a file into a si= ngle erase block. Blocks are allocated as data comes in (as long as enough = SLC cells are available in this area, else directly in QLC cells). Your man= y parallel updates will cause fractions of each larger file to be spread ou= t over many erase blocks.

As you can see, a single file that is deleted may affect many erase bloc= ks, and you have to take redundancy into consideration, which will multiply= the effect by a factor of up to 3 for small files (one ZFS allocation bloc= k). And remember: deleting a message in mdir format will free the data bloc= ks, but will also remove the directory entry, causing additional meta-data = writes (again multiplied by the raid redundancy).


A consumer SSD would normally see only very few parallel writes, and seq= uential writes of full files will have a high chance to put the data of eac= h file contiguously in the minimum number of erase blocks, allowing to free= multiple complete erase blocks when such a file is deleted and thus obviat= ing the need for many garbage collection copies (that occur if data from se= veral independent files is in one erase block).

Actual SSDs have many more cells than advertised. Some 10% to 20% may be= kept as a reserve for aging blocks that e.g. may have failed kind of a "re= ad-after-write test" (implemented in the write function, which adds charges= to the cells until they return the correct read-outs).

BTW: Having an ashift value that is lower than the internal write block = size may also lead to higher write amplification values, but a large ashift= may lead to more wasted capacity, which may become an issue if typical fil= e length are much smaller than the allocation granularity that results from= the ashift value.


Larger mails are less of an issue since they span multiple erase blocks,= which will be completely freed when such a message is deleted.

I see I see Stefan...

Samsung has a lot of experience and generally good strategies to deal wi= th such a situation, but SSDs specified for use in storage systems might be= much better suited for that kind of usage profile.

Yes... and the disks for our pur= pose... perhaps weren't QVOs....

You should have got (much more expensive) server grade SSDs, IMHO.

But even 4 * 2 TB QVO (or better EVO) drives per each 8 TB QVO drive wou= ld result in better performance (but would need a lot of extra SATA ports)= =2E

In fact, I'm not sure whether rotating media and a reasonable L2ARC cons= isting of a fast M.2 SSD plus a mirror of small SSDs for a LOG device would= not be a better match for your use case. Reading the L2ARC would be very f= ast, writes would be purely sequential and relatively slow, you could choos= e a suitable L2ARC strategy (caching of file data vs. meta data), and the L= OG device would support fast fsync() operations required for reliable mail = systems (which confirm data is on stable storage before acknowledging the r= eception to the sender).

We knew they had some speed issues, but those speed issues, we thou= ght (as
Samsung explains in the QVO site) they started after exceedin= g the speeding
buffer this disks have. We though that meanwhile you d= idn't exceed it's
capacity (the capacity of the speeding buffer) no s= peed problem arises. Perhaps
we were wrong?.

These drives are meant for small loads in a typical PC use case,
i.e. some installations of software in the few GB range, else only
= files of a few MB being written, perhaps an import of media files
th= at range from tens to a few hundred MB at a time, but less often
than= once a day.
 
We move, you know... lots of littl= e files... and lot's of different concurrent modifications by 1500-2000 con= current imap connections we have...

I do not expect the read load to be a problem (except possibly when the = SSD is moving data from SLC to QLC blocks, but even then reads will get pri= ority). But writes and trims might very well overwhelm the SSD, especially = when its getting full. Keeping a part of the SSD unused (excluded from the = partitions created) will lead to a large pool of unused blocks. This will r= educe the write amplification - there are many free blocks in the "unpartit= ioned part" of the SSD, and thus there is less urgency to compact partially= filled blocks. (E.g. if you include only 3/4 of the SSD capacity in a part= ition used for the ZPOOL, then 1/4 of each erase block could be free due to= deletions/TRIM without any compactions required to hold all this data.)

Keeping a significant percentage of the SSD unallocated is a good strate= gy to improve its performance and resilience.

Well, we have allocated all the = disk space... but not used... just allocated.... you know... we do a zpool = create with the whole disks.....

I think the only chance for a solution that does not require new hardwar= e is to make sure, only some 80% of the SSDs are used (i.e. allocate only 8= 0% for ZFS, leave 20% unallocated). This will significantly reduce the rate= of garbage collections and thus reduce the load they cause.

I'd use a fast encryption algorithm (zstd - choose a level that does not= overwhelm the CPU, there are benchmark results for ZFS with zstd, and I fo= und zstd-2 to be best for my use case). This will more than make up for the= space you left unallocated on the SSDs.

A different mail box format might help, too - I'm happy with dovecot's m= dbox format, which is as fast but much more efficient than mdir.

As the SSD fills, the space available for the single level write
cac= he gets smaller
 
The single level write cache is th= e cache these ssd drivers have, for compensating the speed issues they have= due to using qlc memory?. Do you refer to that?. Sorry I don't understand = well this paragraph.

Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC = cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as= 24 GB of data in QLC mode.

Ok, true.... yes....

A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (60= 0 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells).

Ahh! you mean that SLC capacity = for speeding up the QLC disks, is obtained from each single layer of the QL= C?.

There are no specific SLC cells. A fraction of the QLC capable cells is = only written with only 1 instead of 4 bits. This is a much simpler process,= since there are only 2 charge levels per cell that are used, while QLC use= s 16 charge levels, and you can only add charge (must not overshoot), there= fore only small increments are added until the correct value can be read ou= t).

But since SLC cells take away specified capacity (which is calculated as= suming all cells hold 4 bits each, not only 1 bit), their number is limited= and shrinks as demand for QLC cells grows.

The advantage of the SLC cache is fast writes, but also that data in it = may have become stale (trimmed) and thus will never be copied over into a Q= LC block. But as the SSD fills and the size of the SLC cache shrinks, this = capability will be mostly lost, and lots of very short lived data is stored= in QLC cells, which will quickly become partially stale and thus needing c= ompaction as explained above.

Therefore, the fraction of the cells used as an SLC cache is reduced whe= n it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cell= s).

Sorry I don't get this last sent= ence... don't understand it because I don't really know the meaning of tn= =2E..

but I think I'm getting the idea= if you say that each QLC layer, has it's own SLC cache obtained from the d= isk space avaiable for each QLC layer....

And with less SLC cells available for short term storage of data the pro= bability of data being copied to QLC cells before the irrelevant messages h= ave been deleted is significantly increased. And that will again lead to ma= ny more blocks with "holes" (deleted messages) in them, which then need to = be copied possibly multiple times to compact them.

If I correct above, I think I go= t the idea yes....

(on many SSDs, I have no nu= mbers for this
particular device), and thus the amount of data that can be
written = at single cell speed shrinks as the SSD gets full.
 

I have just looked up the size of the SLC cache, it is specified to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB
= version, smaller models will have a smaller SLC cache).
 
Assuming you were talking about th= e cache for compensating speed we previously commented, I should say these = are the 870 QVO but the 8TB version. So they should have the biggest cache = for compensating the speed issues...

I have looked up the data: the larger versions of the 870 QVO have the s= ame SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB= more if there are enough free blocks.

Ours one is the 8TB model so I a= ssume it could have bigger limits. The disks are mostly empty, really.... s= o... for instance....

zpool list
= NAME     =         SIZE  ALLOC   FRE= E  CKPOINT  EXPANDSZ   FRAG    CAP = DEDUP  HEALTH  ALTROOT
root_dataset  448G  2.29G   446G&n= bsp;       -     &nb= sp;   -     1%     0%&nbs= p; 1.00x  ONLINE  -
mail_dataset  58.2T  11.8T  46.4T &nbs= p;      -       = ;  -    26%    20%  1.00x  ONL= INE  -

Ok, seems you have got 10 * 8 TB in a raidz2 configuration.

Only 20% of the mail dataset is in use, the situation will become much w= orse when the pool will fill up!

I suppose fragmentation affects = too....

On magnetic media fragmentation means that a file is spread out over the= disk in a non-optimal way, causing access latencies due to seeks and rotat= ional delay. That kind of fragmentation is not really relevant for SSDs, wh= ich allow for fast random access to the cells.

And the FRAG value shown by the "zpool list" command is not about fragme= ntation of files at all, it is about the structure of free space. Anyway le= ss relevant for SSDs than for classic hard disk drives.

But after writing those few GB at a speed of some 500 MB/s (i.e.
aft= er 12 to 150 seconds), the drive will need several minutes to
transfe= r those writes to the quad-level cells, and will operate
at a fractio= n of the nominal performance during that time.
(QLC writes max out at= 80 MB/s for the 1 TB model, 160 MB/s for the
2 TB model.)
 
Well we are in the 8TB model. I th= ink I have understood what you wrote in previous paragraph. You said they c= an be fast but not constantly, because later they have to write all that to= their perpetual storage from the cache. And that's slow. Am I wrong?. Even= in the 8TB model you think Stefan?.

The controller in the SSD supports a given number of channels (e.g 4), e= ach of which can access a Flash chip independently of the others. Small SSD= s often have less Flash chips than there are channels (and thus a lower thr= oughput, especially for writes), but the larger models often have more chip= s than channels and thus the performance is capped.

This is totally logical. If a QV= O disk would outperform best or similar than an Intel without consequences= =2E... who was going to buy a expensive Intel enterprise?.<= /p>

The QVO is bandwidth limited due to the SATA data rate of 6 Mbit/s anyway, = and it is optimized for reads (which are not significantly slower than offe= red by the TLC models). This is a viable concept for a consumer PC, but not= for a server.

In the case of the 870 QVO, the controller supports 8 channels, which al= lows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has= only 4 Flash chips and is thus limited to 80 MB/s in that situation, while= the larger versions have 8, 16, or 32 chips. But due to the limited number= of channels, the write rate is limited to 160 MB/s even for the 8 TB model= =2E

Totally logical Stefan...=

If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in thi= s limit.

The main problem we are facing is = that in some peak moments, when the machine serves connections for all the = instances it has, and only as said in some peak moments... like the 09am or= the 11am.... it seems the machine becomes slower... and like if the disks = weren't able to serve all they have to serve.... In these moments, no big f= iles are moved... but as we have 1800-2000 concurrent imap connections... n= ormally they are doing each one... little changes in their mailbox. Do you = think perhaps this disks then are not appropriate for this kind of usage?-<= /strong>

I'd guess that the drives get into a state in which they have to recycle= lots of partially free blocks (i.e. perform kind of a garbage collection) = and then three kinds of operations are competing with each other:

  1. reads (generally prioritized)
  2. writes (filling the SLC cache up to its maximum size)
  3. compactions of partially filled blocks (required to make free blocks av= ailable for re-use)

Writes can only proceed if there are sufficient free blocks, which on a = filled SSD with partially filled erase blocks means that operations of type= 3. need to be performed with priority to not stall all writes.

My assumption is that this is what you are observing under peak load.

It could be although the disks a= re not filled.... the pool are at 20 or 30% of capacity and fragmentation f= rom 20%-30% (as zpool list states).

Yes, and that means that your issues will become much more critical over ti= me when the free space shrinks and garbage collections will be required at = an even faster rate, with the SLC cache becoming less and less effective to= weed out short lived files as an additional factor that will increase writ= e amplification.
And cheap SSDs often have no RAM cache (not checked, but I'd be
surp= rised if the QVO had one) and thus cannot keep bookkeeping date
in su= ch a cache, further limiting the performance under load.
 
This brochure (https://semiconductor.samsung.com/resources/brochu= re/870_Series_Brochure.pdf and the datasheet https://semiconductor.samsung.com/resources/data-sheet/Samsung_S= SD_870_QVO_Data_Sheet_Rev1.1.pdf) sais if I have read properly, the 8TB= drive has 8GB of ram?. I assume that is what they call the turbo write cac= he?.

No, the turbo write cache consists of the cells used in SLC mode (which = can be any cells, not only cells in a specific area of the flash chip).

I see I see....<= /p>

The RAM is needed for fast lookup of the position of data for reads and = of free blocks for writes.

Our ones... seem to have 8GB LPD= DR4 of ram.... as datasheet states....

Yes, and it makes sense that the RAM size is proportional to the capacit= y since a few bytes are required per addressable data block.

If the block size was 8 KB the RAM could hold 8 bytes (e.g. a pointer an= d some status flags) for each logically addressable block. But there is no = information about the actual internal structure of the QVO that I know of= =2E

[...]

I see.... It's extremely mislead= ing you know... because... you can copy five mailboxes of 50GB concurrently= for instance.... and you flood a gigabit interface copying (obviously beca= use disks can keep that throughput)... but later.... you see... you are in = an hour that yesterday, and even 4 days before you have not had any issues= =2E.. and that day... you see the commented issue... even not being exactly= at a peak hour (perhaps is two hours later the peak hour even)... or... bu= t I wasn't noticing about all things you say in this email....

I have seen advice to not use compression in a high load scenario in som= e other reply.

I tend to disagree: Since you seem to be limited when the SLC cache is e= xhausted, you should get better performance if you compress your data. I ha= ve found that zstd-2 works well for me (giving a significant overall reduct= ion of size at reasonable additional CPU load). Since ZFS allows to switch = compressions algorithms at any time, you can experiment with different algo= rithms and levels.

I see... you say compression sho= uld be enabled.... The main reason because we have not enabled it yet, is f= or keeping the system the most near possible to config defaults... you know= =2E.. for later being able to ask in this mailing lists if we have an issue= =2E.. because you know... it would be far more easier to ask about somethin= g strange you are seeing when that strange thing is near to a well tested c= onfig, like the config by default....

But now you say Stefan... if you= switch between compression algorithms you will end up with a mix of differ= ent files compressed in a different manner... that is not a bit disaster la= ter?. Doesn't affect performance in some manner?.

The compression used is stored in the per file information, each file in a = dataset could have been written with a different compression method and lev= el. Blocks are independently compressed - a file level compression may be m= ore effective. Large mail files will contain incompressible attachments (al= ready compressed), but in base64 encoding. This should allow a compression = ratio of ~1,3. Small files will be plain text or HTML, offering much better= compression factors.

One advantage of ZFS compression is that it applies to the ARC, too. And= a compression factor of 2 should easily be achieved when storing mail (not= for .docx, .pdf, .jpg files though). Having more data in the ARC will redu= ce the read pressure on the SSDs and will give them more cycles for garbage= collections (which are performed in the background and required to always = have a sufficient reserve of free flash blocks for writes).

We would use I assume the lz4.= =2E. which is the less "expensive" compression algorithm for the CPU... and= I assume too for avoiding delay accessing data... do you recommend another= one?. Do you always recommend compression then?.

I'd prefer zstd over lz4 since it offers a much higher compression ratio= =2E

Zstd offers higher compression ratios than lz4 at similar or better deco= mpression speed, but may be somewhat slower compressing the data. But in my= opinion this is outweighed by the higher effective amount of data in the A= RC/L2ARC possible with zstd.

For some benchmarks of different compression algorithms available for ZF= S and compared to uncompressed mode see the extensive results published by = Jude Allan:

htt=
ps://docs.google.com/spreadsheets/d/1TvCAIDzFsjuLuea7124q-1UtMd0C9amTgnXm2y=
PtiUQ/edit?usp=3Dsharing

The SQL benchmarks might best resemble your use case - but remember that a =
significant reduction of the amount of data being written to the SSDs might=
 be more important than the highest transaction rate, since your SSDs put a=
 low upper limit on that when highly loaded.

I'd give it a try - and if it reduces your storage requirements by 10% o= nly, then keep 10% of each SSD unused (not assigned to any partition). That= will greatly improve the resilience of your SSDs, reduce the write-amplifi= cation, will allow the SLC cache to stay at its large value, and may make a= large difference to the effective performance under high load.

But when you enable compression= =2E.. only gets compressed the new data modified or entered. Am I wrong?.

Compression is per file system data block (at most 1 MB if you set the bloc= ksize to that value). Each such block is compressed independently of all ot= hers, to not require more than 1 block to be read and decompressed when ran= domly reading a file. If a block does not shrink when compressed (it may co= ntain compressed file data) the block is written to disk as-is (uncompresse= d).


By the way, we have more or less= 1/4 of each disk used (12 TB allocated in a poll stated by zpool list, div= ided between 8 disks of 8TB...)... do you think we could be suffering on wr= ite amplification and so... having a so little disk space used in each disk= ?.

Your use case will cause a lot of garbage collections and this particular h= igh write amplification values.

Regards, STefan

Hey mate, your mail is incredibl= e. It has helped as a lot. Can we invite you a cup of coffee or a beer thro= ugh Paypal or similar?. Can I help you in some manner?.

Thanks, I'm glad to help, and I'd appreciate to hear whether you get you= r setup optimized for the purpose (and how well it holds up when you approa= ch the capacity limits of your drives).

I'm always interested in experience of users with different use cases th= an I have (just being a developer with too much archived mail and media col= lected over a few decades).

Regards, STefan

--=_fe85fe2db536d584a8585a29da09a2df-- From nobody Sat Apr 9 11:47:49 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id CD2A81A8358E; Sat, 9 Apr 2022 11:47:57 +0000 (UTC) (envelope-from jamie@catflap.org) Received: from donotpassgo.dyslexicfish.net (donotpassgo.dyslexicfish.net [IPv6:2001:19f0:300:2185:123::1]) by mx1.freebsd.org (Postfix) with ESMTP id 4KbCzd01Dgz3nH4; Sat, 9 Apr 2022 11:47:56 +0000 (UTC) (envelope-from jamie@catflap.org) X-Catflap-Envelope-From: X-Catflap-Envelope-To: freebsd-fs@FreeBSD.org Received: from donotpassgo.dyslexicfish.net (donotpassgo.dyslexicfish.net [104.207.135.49]) by donotpassgo.dyslexicfish.net (8.14.5/8.14.5) with ESMTP id 239BloSI006666; Sat, 9 Apr 2022 12:47:50 +0100 (BST) (envelope-from jamie@donotpassgo.dyslexicfish.net) Received: (from jamie@localhost) by donotpassgo.dyslexicfish.net (8.14.5/8.14.5/Submit) id 239BlncJ006665; Sat, 9 Apr 2022 12:47:49 +0100 (BST) (envelope-from jamie) From: Jamie Landeg-Jones Message-Id: <202204091147.239BlncJ006665@donotpassgo.dyslexicfish.net> Date: Sat, 09 Apr 2022 12:47:49 +0100 Organization: Dyslexic Fish To: grarpamp@gmail.com, freebsd-hackers@FreeBSD.org Cc: freebsd-questions@FreeBSD.org, freebsd-performance@FreeBSD.org, freebsd-fs@FreeBSD.org Subject: Re: List Mail Formatting Netiquette [ie: 870 QVO] References: In-Reply-To: User-Agent: Heirloom mailx 12.4 7/29/08 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.2.7 (donotpassgo.dyslexicfish.net [104.207.135.49]); Sat, 09 Apr 2022 12:47:50 +0100 (BST) X-Rspamd-Queue-Id: 4KbCzd01Dgz3nH4 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=none) header.from=catflap.org; spf=pass (mx1.freebsd.org: domain of jamie@catflap.org designates 2001:19f0:300:2185:123::1 as permitted sender) smtp.mailfrom=jamie@catflap.org X-Spamd-Result: default: False [-3.53 / 15.00]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; FREEFALL_USER(0.00)[jamie]; FROM_HAS_DN(0.00)[]; R_SPF_ALLOW(-0.20)[+mx:dyslexicfish.net]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; TO_DN_NONE(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; HAS_ORG_HEADER(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-0.83)[-0.834]; DMARC_POLICY_ALLOW(-0.50)[catflap.org,none]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance,freebsd-questions]; FREEMAIL_TO(0.00)[gmail.com,FreeBSD.org]; RCVD_NO_TLS_LAST(0.10)[]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; ASN(0.00)[asn:20473, ipnet:2001:19f0::/38, country:US]; RCVD_COUNT_TWO(0.00)[2] X-ThisMailContainsUnwantedMimeParts: N grarpamp wrote: > Use the clean habit mailiquette rules to make reading > the world better for everyone :) Whilst I agree entirely, the issue has been raised before, and nothing happens. The sort of posting styles that used to get people kicked off lists are ignored. No-one cares any more. P.S. I gave up trying to decypher the "replies in blue" messages - the whole thing is in white-on-black on this terminal! P.P.S. We're probably both guilty of crossposting, but as I said, no-one cares! P.P.P.S. Maybe I should just have replied, quoting your whole message with just "me too" added to the top :-) Cheers, Jamie From nobody Mon May 2 13:36:23 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 57E221AB0845; Mon, 2 May 2022 13:42:10 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KsPQm5j8cz3pJN; Mon, 2 May 2022 13:42:08 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id 1365560C0D5; Mon, 2 May 2022 15:36:23 +0200 (CEST) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_672a997159081f6af8e5fa01d8fde077" Date: Mon, 02 May 2022 15:36:23 +0200 From: egoitz@ramattack.net To: egoitz@ramattack.net Cc: Stefan Esser , freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner , owner-freebsd-hackers@freebsd.org Subject: Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> <3d24c87110b4a155e3f14d53a9309c61@ramattack.net> Message-ID: <32cb580b30636082108a070ee009fdb9@ramattack.net> X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KsPQm5j8cz3pJN X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; RCPT_COUNT_SEVEN(0.00)[7]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_672a997159081f6af8e5fa01d8fde077 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Matthias, I apologize if these emails were annoying for reading or similar... due to the html, top posting etc... but... in a so large mail thread, how could one properly differ one answer from the other one.. how could you highlight some sentences for keeping attention between relevant parts and so... it's difficult.... Anyway, very very sorry for having caused some noise... Best regards, El 2022-04-08 20:00, Matthias Apitz escribió: > ATENCION > ATENCION > ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > Hello egoitz@ramattack.net, > > Please be so kind and stop sending mails in HTML and stop top posting. > Thanks in advance > > matthias --=_672a997159081f6af8e5fa01d8fde077 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Matthias,


I apologize if these emails were annoying for reading or similar... due = to the html, top posting etc... but... in a so large mail thread, how could= one properly differ one answer from the other one.. how could you highligh= t some sentences for keeping attention between relevant parts and so... it'= s difficult....


Anyway, very very sorry for having caused some noise...


Best regards,


El 2022-04-08 20:00, Matthias Apitz escribió:

= ATENCION
ATENCION
ATENCION!!! Este correo se ha enviado desde f= uera de la organizacion. No pinche en los enlaces ni abra los adjuntos a no= ser que reconozca el remitente y sepa que el contenido es seguro.

Hello egoitz@ramattack= =2Enet,

Please be so kind and stop sending mails in HTML a= nd stop top posting.
Thanks in advance

   = ; matthias
=  
--=_672a997159081f6af8e5fa01d8fde077-- From nobody Wed Jul 6 08:29:48 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id B4B681D01DEE for ; Wed, 6 Jul 2022 08:30:01 +0000 (UTC) (envelope-from tdtemccna@gmail.com) Received: from mail-lj1-x22a.google.com (mail-lj1-x22a.google.com [IPv6:2a00:1450:4864:20::22a]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4LdCQd1FVzz4mt8 for ; Wed, 6 Jul 2022 08:30:01 +0000 (UTC) (envelope-from tdtemccna@gmail.com) Received: by mail-lj1-x22a.google.com with SMTP id n15so17490920ljg.8 for ; Wed, 06 Jul 2022 01:30:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:from:date:message-id:subject:to:cc; bh=qNJS8bXJjIJEOjJk4e4xeTOXLNTgbklFiRogoFnIe/0=; b=SYq4u1JrhKwx4hQJhU0xET3WKM30m18FQq+YM9FVtN9ZzEixibtX4jMpYllaFZEFFE CO8x8TleRzq/DnxzNvl2BADHi3xh8AfqReuOUaBuELPsscLV98zVwld0NEeP0xT2aZCf vaUj+IqV/SqRj2PU2Z6+6ye+Id8CaE3KHb9FAzRm8xTBukW/iPpazjPHS3Hc1e3knSZP 5CRHLLSAqWeIyNqJeagD7YovPox5dCc/yb91kqEbHFxNxk1wVt0wKOFtymr8rGQARMKG G36up63tmJ1K/vGz8U4/LdyzbN0tSUl/phR17AqmXfDlukxkQq0SYOJ3En0hE/epi/ZL dEtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=qNJS8bXJjIJEOjJk4e4xeTOXLNTgbklFiRogoFnIe/0=; b=mlV0nK/B3Scbjbk5RMghXQsLuiI7xwO6/U8uFyB48Y24QxqPg3MELhejnfR/zlUFlN Ceqv9DsVskb3HG911k44szB8xXiISOL9yTgM9g/UwazIrutSlllgccO/chYDerzAyeAy xhTU0If+1Zb9wZC5wNCmDu2/oqxEM08CzAv0MEe/g8Khf95wJ2k3DGpVAZMqJ7g1krwB WiePlkQx1Umwcp1XeTiea8z9J45XErr0lORhxT8GLEy815vM+zeFI+NdVIPWjHjzpEU/ S9nh96es6LX1xP+4/WEcGH3qn8soL9Nx8YMvTA8nImP/nNozNSZCh5acCxlk3Avt4SiK 53xg== X-Gm-Message-State: AJIora+G6qgpVRoqV+KFkp49aI4DNe5/vyRziKr63qKP5kvKix2yHu6A wepVeZsnwvSLqSkUOzM1ooykj8zBSuCKcaf/9+FY4HFlIFgyQA== X-Google-Smtp-Source: AGRyM1tkv57/zQtKbOusWQNMpwNtWp84QlqtbNooIrojAsJWRmyRO5x4CIYuxwXmqo/yIJQZhEy5mU7/sUUhQaWA2X4= X-Received: by 2002:a05:651c:1509:b0:25b:b4b6:f854 with SMTP id e9-20020a05651c150900b0025bb4b6f854mr23172486ljf.447.1657096199856; Wed, 06 Jul 2022 01:29:59 -0700 (PDT) List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 From: Turritopsis Dohrnii Teo En Ming Date: Wed, 6 Jul 2022 16:29:48 +0800 Message-ID: Subject: FreeBSD is a great operating system! To: freebsd-performance@freebsd.org Cc: ceo@teo-en-ming-corp.com Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4LdCQd1FVzz4mt8 X-Spamd-Bar: ---- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=SYq4u1Jr; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of tdtemccna@gmail.com designates 2a00:1450:4864:20::22a as permitted sender) smtp.mailfrom=tdtemccna@gmail.com X-Spamd-Result: default: False [-4.00 / 15.00]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36:c]; FREEMAIL_FROM(0.00)[gmail.com]; TO_DN_NONE(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; RCPT_COUNT_TWO(0.00)[2]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; NEURAL_HAM_SHORT(-1.00)[-1.000]; SUBJECT_ENDS_EXCLAIM(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; FREEMAIL_ENVFROM(0.00)[gmail.com]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; ARC_NA(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; FROM_HAS_DN(0.00)[]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[text/plain]; PREVIOUSLY_DELIVERED(0.00)[freebsd-performance@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::22a:from]; MLMMJ_DEST(0.00)[freebsd-performance]; RCVD_COUNT_TWO(0.00)[2]; RCVD_TLS_ALL(0.00)[] X-ThisMailContainsUnwantedMimeParts: N Subject: FreeBSD is a great operating system! Good day from Singapore, I think FreeBSD is a great operating system! I support FreeBSD because the most popular pfSense firewall, the extremely popular OPNsense firewall and the BSD Router Project are all powered by FreeBSD! macOS is also based on FreeBSD! I use pfSense community edition firewall in my home. I am planning to try out OPNsense firewall next. I will continue to support FreeBSD! It is a great operating system! FreeBSD is a very good network operating system. Regards, Mr. Turritopsis Dohrnii Teo En Ming Targeted Individual in Singapore 6 July 2022 Wed Blogs: https://tdtemcerts.blogspot.com https://tdtemcerts.wordpress.com From nobody Tue Oct 18 19:16:00 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4MsNrM2kwCz4fGsc for ; Tue, 18 Oct 2022 19:16:19 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4MsNrL0yvDz3nl4; Tue, 18 Oct 2022 19:16:18 +0000 (UTC) (envelope-from mike@sentex.net) Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.16.1/8.16.1) with ESMTPS id 29IJG0cx083114 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Tue, 18 Oct 2022 15:16:01 -0400 (EDT) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:5132:67e3:2e60:df9c] ([IPv6:2607:f3e0:0:4:5132:67e3:2e60:df9c]) by pyroxene2a.sentex.ca (8.16.1/8.15.2) with ESMTPS id 29IJG0aT074463 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Tue, 18 Oct 2022 15:16:00 -0400 (EDT) (envelope-from mike@sentex.net) Message-ID: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> Date: Tue, 18 Oct 2022 15:16:00 -0400 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.3.3 Content-Language: en-US To: Freebsd performance From: mike tancsa Subject: Chelsio Forwarding performance and RELENG_13 vs RELENG_12 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 64.7.153.18 X-Rspamd-Queue-Id: 4MsNrL0yvDz3nl4 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net X-Spamd-Result: default: False [-3.40 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.999]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received]; MIME_GOOD(-0.10)[text/plain]; FROM_EQ_ENVFROM(0.00)[]; RCVD_TLS_ALL(0.00)[]; R_DKIM_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-performance@freebsd.org]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_THREE(0.00)[3]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[mike]; TO_DN_ALL(0.00)[]; TO_MATCH_ENVRCPT_ALL(0.00)[]; RCPT_COUNT_ONE(0.00)[1]; DMARC_NA(0.00)[sentex.net]; ARC_NA(0.00)[] X-ThisMailContainsUnwantedMimeParts: N I updated a RELENG_12 router along with the hardware to RELENG_13 (oct 14th kernel) and was surprised to see an increase in dev.cxl.0.stats.rx_ovflow0 at a somewhat faster rate than I was seeing on the older slightly slower hardware under about the same load.  (Xeon(R) E-2226G CPU @ 3.40GHz) vs a 4 core Xeon same freq, same memory speed. About 150Kpps in and out and a 1Gb/s throughput loader.conf is the same hw.cxgbe.toecaps_allowed="0" hw.cxgbe.rdmacaps_allowed="0" hw.cxgbe.iscsicaps_allowed="0" hw.cxgbe.fcoecaps_allowed="0" hw.cxgbe.pause_settings="0" hw.cxgbe.attack_filter="1" hw.cxgbe.drop_pkts_with_l3_errors="1" As there is a large routing table, I do have [fib_algo] inet.0 (radix4_lockless#46) rebuild_fd_flm: switching algo to radix4 [fib_algo] inet6.0 (radix6_lockless#58) rebuild_fd_flm: switching algo to radix6 kicking in. and sysctl.conf net.route.multipath=0 net.inet.ip.redirect=0 net.inet6.ip6.redirect=0 kern.ipc.maxsockbuf=16777216 net.inet.tcp.blackhole=1 Are there any other tweaks that can be done in order to better forwarding performance ? I do see at bootup time cxl0: nrxq (6), hw RSS table size (128); expect uneven traffic distribution. cxl1: nrxq (6), hw RSS table size (128); expect uneven traffic distribution. cxl3: nrxq (6), hw RSS table size (128); expect uneven traffic distribution. The cpu is 6 core. No HT enabled real memory  = 34359738368 (32768 MB) avail memory = 33238708224 (31698 MB) Event timer "LAPIC" quality 600 ACPI APIC Table: < > FreeBSD/SMP: Multiprocessor System Detected: 6 CPUs FreeBSD/SMP: 1 package(s) x 6 core(s) random: registering fast source Intel Secure Key RNG just a handful of ipfw rules (no states) that were the same as before and a dozen or so cxgbe firewall rules in the NIC Anything I can try / look at that might be causing the odd overflow on cxl0 ? Its a T540-CR with 3 ports in use. t5nex0@pci0:2:0:4:      class=0x020000 rev=0x00 hdr=0x00 vendor=0x1425 device=0x5403 subvendor=0x1425 subdevice=0x0000     vendor     = 'Chelsio Communications Inc'     device     = 'T540-CR Unified Wire Ethernet Controller'     class      = network     subclass   = ethernet     bar   [10] = type Memory, range 64, base 0x91300000, size 524288, enabled     bar   [18] = type Memory, range 64, base 0x90000000, size 16777216, enabled     bar   [20] = type Memory, range 64, base 0x91984000, size 8192, enabled     cap 01[40] = powerspec 3  supports D0 D3  current D0     cap 05[50] = MSI supports 32 messages, 64 bit, vector masks     cap 10[70] = PCI-Express 2 endpoint max data 256(2048) FLR                  max read 4096                  link x8(x8) speed 8.0(8.0) ASPM L0s/L1(L0s/L1)     cap 11[b0] = MSI-X supports 128 messages, enabled                  Table in map 0x20[0x0], PBA in map 0x20[0x1000]     cap 03[d0] = VPD     ecap 0001[100] = AER 2 0 fatal 0 non-fatal 5 corrected     ecap 0003[170] = Serial 1 0000000000000000     ecap 000e[190] = ARI 1     ecap 0019[1a0] = PCIe Sec 1 lane errors 0     ecap 0010[1c0] = SR-IOV 1 IOV disabled, Memory Space disabled, ARI disabled                      0 VFs configured out of 0 supported                      First VF RID Offset 0x0008, VF RID Stride 0x0004                      VF Device ID 0x5803                      Page Sizes: 4096 (enabled), 8192, 65536, 262144, 1048576, 4194304     ecap 0017[200] = TPH Requester 1 Thanks for any suggestions     ---Mike From nobody Fri Oct 21 17:31:36 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4MvBNF3G49z4gGyS for ; Fri, 21 Oct 2022 17:31:41 +0000 (UTC) (envelope-from nparhar@gmail.com) Received: from mail-pj1-x1029.google.com (mail-pj1-x1029.google.com [IPv6:2607:f8b0:4864:20::1029]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4MvBND0DlPz47T5 for ; Fri, 21 Oct 2022 17:31:40 +0000 (UTC) (envelope-from nparhar@gmail.com) Received: by mail-pj1-x1029.google.com with SMTP id ez6so3031567pjb.1 for ; Fri, 21 Oct 2022 10:31:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=Jyc73eCl8GweAqmQtRLzwAEUayVcC35JhOpQ6yuzTVg=; b=j2RCn7/YNwdxAYWffq4TfR74AZr/3TzMGPhK84MvQR4yQ/a5IR+PwPQnZKOHub5NRu WR118DEMlnwOXTLy1EJyvV7E7Mrz9we4pK6drL3X1jA4BcdRbICAaMmbGfhMoHHZEX9F VL6x0QYs0pZJRXNay20mPp7i75YFW80iXIUCFMuPKueHPknGWOvsVQG+lsx6phdlamR/ PTuqBIj0cWFTyyNKPFCNzgdI/GKwf0ctGWOPYXX11vuqTMsuHbgH3KfZRJLdk0lOJE57 ge1nmNwDqYYKqLIQ4Z/1yKtI5j7B4uoqUmUsMGnk1XM+qh7YdJPiECxOCwHoaNp1BdrZ MF7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Jyc73eCl8GweAqmQtRLzwAEUayVcC35JhOpQ6yuzTVg=; b=LdW8ojnmFstT2cEPKGGxKBEc4nTeuViGN4Hq2gGNFhYMf2mDeR5IVCGMUZi+BSxGW2 M2i2uxBhvcibfn5jWMxCf61qUOASZuTRiJb+7nsjQnXTeVl+cTPN+Pcg7YrTwMji7sKj jadrvNgNQeL36pO4JSbOok52JaooHn9MYyIJVbZcwQOgzd9VMCohn0TxU69NpsPHUVEG Nk9/3D6PvszA1C/XFMi8VgxQcWHAgrlgbwPr6/gnkX+LzrV12+NiY7ZVPv+X2GX9AKq9 YC4nmS/TUINUVOfCdGPl28ANT4MBkEZIKmTTLWfeAmKdCKg3tF6djhgJVueDSAx5KViS 7NEw== X-Gm-Message-State: ACrzQf2h10M5rdV4px0WUlStEa5B8DJ+v8BCnDKvJYUQnhmWxyLN35+b TZn1jmaw4zpoy3yxgoM9LfNesrakdiw= X-Google-Smtp-Source: AMsMyM4Okgr2yCYff7aX8BtXjAEae7p1QrZu2eUiRPsNd0RtaWg8BjKKc+nBKiGuh+7ZhGCMpEXrQA== X-Received: by 2002:a17:90b:1c8c:b0:203:89fb:ba79 with SMTP id oo12-20020a17090b1c8c00b0020389fbba79mr59114145pjb.92.1666373498504; Fri, 21 Oct 2022 10:31:38 -0700 (PDT) Received: from [10.192.161.10] (stargate.chelsio.com. [12.32.117.8]) by smtp.googlemail.com with ESMTPSA id a188-20020a6366c5000000b00460ea630c1bsm13661433pgc.46.2022.10.21.10.31.37 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 21 Oct 2022 10:31:38 -0700 (PDT) Message-ID: <92cdf4b8-2209-ec44-8151-a59b9e8f1504@gmail.com> Date: Fri, 21 Oct 2022 10:31:36 -0700 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:102.0) Gecko/20100101 Thunderbird/102.3.2 Subject: Re: Chelsio Forwarding performance and RELENG_13 vs RELENG_12 Content-Language: en-US To: mike tancsa , Freebsd performance References: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> From: Navdeep Parhar In-Reply-To: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 4MvBND0DlPz47T5 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b="j2RCn7/Y"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of nparhar@gmail.com designates 2607:f8b0:4864:20::1029 as permitted sender) smtp.mailfrom=nparhar@gmail.com X-Spamd-Result: default: False [-4.00 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; MIME_GOOD(-0.10)[text/plain]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; FROM_HAS_DN(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-performance@freebsd.org]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::1029:from]; TO_DN_ALL(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_TLS_LAST(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; FROM_EQ_ENVFROM(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+]; MLMMJ_DEST(0.00)[freebsd-performance@freebsd.org] X-ThisMailContainsUnwantedMimeParts: N On 10/18/22 12:16 PM, mike tancsa wrote: > I updated a RELENG_12 router along with the hardware to RELENG_13 (oct > 14th kernel) and was surprised to see an increase in > dev.cxl.0.stats.rx_ovflow0 at a somewhat faster rate than I was seeing > on the older slightly slower hardware under about the same load. > (Xeon(R) E-2226G CPU @ 3.40GHz) vs a 4 core Xeon same freq, same memory > speed. About 150Kpps in and out and a 1Gb/s throughput > > loader.conf is the same > > > hw.cxgbe.toecaps_allowed="0" > hw.cxgbe.rdmacaps_allowed="0" > hw.cxgbe.iscsicaps_allowed="0" > hw.cxgbe.fcoecaps_allowed="0" > hw.cxgbe.pause_settings="0" > hw.cxgbe.attack_filter="1" > hw.cxgbe.drop_pkts_with_l3_errors="1" > > As there is a large routing table, I do have > > [fib_algo] inet.0 (radix4_lockless#46) rebuild_fd_flm: switching algo to > radix4 > [fib_algo] inet6.0 (radix6_lockless#58) rebuild_fd_flm: switching algo > to radix6 > > kicking in. > > and sysctl.conf > > net.route.multipath=0 > > net.inet.ip.redirect=0 > net.inet6.ip6.redirect=0 > kern.ipc.maxsockbuf=16777216 > net.inet.tcp.blackhole=1 > > Are there any other tweaks that can be done in order to better > forwarding performance ? I do see at bootup time > > cxl0: nrxq (6), hw RSS table size (128); expect uneven traffic > distribution. > cxl1: nrxq (6), hw RSS table size (128); expect uneven traffic > distribution. > cxl3: nrxq (6), hw RSS table size (128); expect uneven traffic > distribution. > > The cpu is 6 core. No HT enabled The old system was 4-core so it must have used 4 queues. Can you please try that on the new system and see how it does? hw.cxgbe.ntxq=4 hw.cxgbe.nrxq=4 Regards, Navdeep From nobody Fri Oct 21 17:57:22 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4MvBy25SVzz4gLyM for ; Fri, 21 Oct 2022 17:57:30 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4MvBy151yxz3CLn for ; Fri, 21 Oct 2022 17:57:29 +0000 (UTC) (envelope-from mike@sentex.net) Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.16.1/8.16.1) with ESMTPS id 29LHvL9w095168 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 21 Oct 2022 13:57:22 -0400 (EDT) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:f808:cefc:42f3:2221] ([IPv6:2607:f3e0:0:4:f808:cefc:42f3:2221]) by pyroxene2a.sentex.ca (8.16.1/8.15.2) with ESMTPS id 29LHvLhx012064 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Fri, 21 Oct 2022 13:57:21 -0400 (EDT) (envelope-from mike@sentex.net) Message-ID: Date: Fri, 21 Oct 2022 13:57:22 -0400 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.3.3 Subject: Re: Chelsio Forwarding performance and RELENG_13 vs RELENG_12 Content-Language: en-US To: Navdeep Parhar , Freebsd performance References: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> <92cdf4b8-2209-ec44-8151-a59b9e8f1504@gmail.com> From: mike tancsa In-Reply-To: <92cdf4b8-2209-ec44-8151-a59b9e8f1504@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 64.7.153.18 X-Rspamd-Queue-Id: 4MvBy151yxz3CLn X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net X-Spamd-Result: default: False [-3.40 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_ALL(0.00)[]; FREEMAIL_TO(0.00)[gmail.com,freebsd.org]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; R_DKIM_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-performance@freebsd.org]; TO_DN_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[mike]; RCPT_COUNT_TWO(0.00)[2]; DMARC_NA(0.00)[sentex.net]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; ARC_NA(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On 10/21/2022 1:31 PM, Navdeep Parhar wrote: > On 10/18/22 12:16 PM, mike tancsa wrote: >> I updated a RELENG_12 router along with the hardware to RELENG_13 >> (oct 14th kernel) and was surprised to see an increase in >> dev.cxl.0.stats.rx_ovflow0 at a somewhat faster rate than I was >> seeing on the older slightly slower hardware under about the same >> load. (Xeon(R) E-2226G CPU @ 3.40GHz) vs a 4 core Xeon same freq, >> same memory speed. About 150Kpps in and out and a 1Gb/s throughput >> >> loader.conf is the same >> >> >> hw.cxgbe.toecaps_allowed="0" >> hw.cxgbe.rdmacaps_allowed="0" >> hw.cxgbe.iscsicaps_allowed="0" >> hw.cxgbe.fcoecaps_allowed="0" >> hw.cxgbe.pause_settings="0" >> hw.cxgbe.attack_filter="1" >> hw.cxgbe.drop_pkts_with_l3_errors="1" >> >> As there is a large routing table, I do have >> >> [fib_algo] inet.0 (radix4_lockless#46) rebuild_fd_flm: switching algo >> to radix4 >> [fib_algo] inet6.0 (radix6_lockless#58) rebuild_fd_flm: switching >> algo to radix6 >> >> kicking in. >> >> and sysctl.conf >> >> net.route.multipath=0 >> >> net.inet.ip.redirect=0 >> net.inet6.ip6.redirect=0 >> kern.ipc.maxsockbuf=16777216 >> net.inet.tcp.blackhole=1 >> >> Are there any other tweaks that can be done in order to better >> forwarding performance ? I do see at bootup time >> >> cxl0: nrxq (6), hw RSS table size (128); expect uneven traffic >> distribution. >> cxl1: nrxq (6), hw RSS table size (128); expect uneven traffic >> distribution. >> cxl3: nrxq (6), hw RSS table size (128); expect uneven traffic >> distribution. >> >> The cpu is 6 core. No HT enabled > > The old system was 4-core so it must have used 4 queues.  Can you > please try that on the new system and see how it does? > > hw.cxgbe.ntxq=4 > hw.cxgbe.nrxq=4 > Thanks Navdeep! Unfortunately, still the odd dropped packet :( dev.cxl.0.stats.rx_ovflow0: 78 Since my initial post, I did try with 16, and it did not seem to impact the rate of overflows either. dev.cxl.0.stats.rx_trunc3: 0 dev.cxl.0.stats.rx_trunc2: 0 dev.cxl.0.stats.rx_trunc1: 0 dev.cxl.0.stats.rx_trunc0: 1 dev.cxl.0.stats.rx_ovflow3: 0 dev.cxl.0.stats.rx_ovflow2: 0 dev.cxl.0.stats.rx_ovflow1: 0 dev.cxl.0.stats.rx_ovflow0: 78 dev.cxl.0.stats.rx_ppp7: 0 dev.cxl.0.stats.rx_ppp6: 0 dev.cxl.0.stats.rx_ppp5: 0 dev.cxl.0.stats.rx_ppp4: 0 dev.cxl.0.stats.rx_ppp3: 0 dev.cxl.0.stats.rx_ppp2: 0 dev.cxl.0.stats.rx_ppp1: 0 dev.cxl.0.stats.rx_ppp0: 0 dev.cxl.0.stats.rx_pause: 0 dev.cxl.0.stats.rx_frames_1519_max: 0 dev.cxl.0.stats.rx_frames_1024_1518: 20724413 dev.cxl.0.stats.rx_frames_512_1023: 1371427 dev.cxl.0.stats.rx_frames_256_511: 1515522 dev.cxl.0.stats.rx_frames_128_255: 2371419 dev.cxl.0.stats.rx_frames_65_127: 3386302 dev.cxl.0.stats.rx_frames_64: 2882681 dev.cxl.0.stats.rx_runt: 0 dev.cxl.0.stats.rx_symbol_err: 0 dev.cxl.0.stats.rx_len_err: 0 dev.cxl.0.stats.rx_fcs_err: 0 dev.cxl.0.stats.rx_jabber: 0 dev.cxl.0.stats.rx_too_long: 0 dev.cxl.0.stats.rx_ucast_frames: 32180697 dev.cxl.0.stats.rx_mcast_frames: 30513 dev.cxl.0.stats.rx_bcast_frames: 40594 dev.cxl.0.stats.rx_frames: 32251814 dev.cxl.0.stats.rx_octets: 31681379196 dev.cxl.0.stats.tx_ppp7: 0 dev.cxl.0.stats.tx_ppp6: 0 dev.cxl.0.stats.tx_ppp5: 0 dev.cxl.0.stats.tx_ppp4: 0 dev.cxl.0.stats.tx_ppp3: 0 dev.cxl.0.stats.tx_ppp2: 0 dev.cxl.0.stats.tx_ppp1: 0 dev.cxl.0.stats.tx_ppp0: 0 dev.cxl.0.stats.tx_pause: 0 dev.cxl.0.stats.tx_drop: 0 dev.cxl.0.stats.tx_frames_1519_max: 0 dev.cxl.0.stats.tx_frames_1024_1518: 5369922 dev.cxl.0.stats.tx_frames_512_1023: 736985 dev.cxl.0.stats.tx_frames_256_511: 842554 dev.cxl.0.stats.tx_frames_128_255: 2708331 dev.cxl.0.stats.tx_frames_65_127: 6254876 dev.cxl.0.stats.tx_frames_64: 2076101 dev.cxl.0.stats.tx_error_frames: 0 dev.cxl.0.stats.tx_ucast_frames: 17988628 dev.cxl.0.stats.tx_mcast_frames: 17 dev.cxl.0.stats.tx_bcast_frames: 130 dev.cxl.0.stats.tx_frames: 17988780 dev.cxl.0.stats.tx_octets: 9797845532 dev.cxl.0.stats.tx_parse_error: 0 dev.cxl.0.tc.14.params: uninitialized dev.cxl.0.tc.14.refcount: 0 dev.cxl.0.tc.14.flags: 0 dev.cxl.0.tc.14.state: 0 dev.cxl.0.tc.13.params: uninitialized dev.cxl.0.tc.13.refcount: 0 dev.cxl.0.tc.13.flags: 0 dev.cxl.0.tc.13.state: 0 dev.cxl.0.tc.12.params: uninitialized dev.cxl.0.tc.12.refcount: 0 dev.cxl.0.tc.12.flags: 0 dev.cxl.0.tc.12.state: 0 dev.cxl.0.tc.11.params: uninitialized dev.cxl.0.tc.11.refcount: 0 dev.cxl.0.tc.11.flags: 0 dev.cxl.0.tc.11.state: 0 dev.cxl.0.tc.10.params: uninitialized dev.cxl.0.tc.10.refcount: 0 dev.cxl.0.tc.10.flags: 0 dev.cxl.0.tc.10.state: 0 dev.cxl.0.tc.9.params: uninitialized dev.cxl.0.tc.9.refcount: 0 dev.cxl.0.tc.9.flags: 0 dev.cxl.0.tc.9.state: 0 dev.cxl.0.tc.8.params: uninitialized dev.cxl.0.tc.8.refcount: 0 dev.cxl.0.tc.8.flags: 0 dev.cxl.0.tc.8.state: 0 dev.cxl.0.tc.7.params: uninitialized dev.cxl.0.tc.7.refcount: 0 dev.cxl.0.tc.7.flags: 0 dev.cxl.0.tc.7.state: 0 dev.cxl.0.tc.6.params: uninitialized dev.cxl.0.tc.6.refcount: 0 dev.cxl.0.tc.6.flags: 0 dev.cxl.0.tc.6.state: 0 dev.cxl.0.tc.5.params: uninitialized dev.cxl.0.tc.5.refcount: 0 dev.cxl.0.tc.5.flags: 0 dev.cxl.0.tc.5.state: 0 dev.cxl.0.tc.4.params: uninitialized dev.cxl.0.tc.4.refcount: 0 dev.cxl.0.tc.4.flags: 0 dev.cxl.0.tc.4.state: 0 dev.cxl.0.tc.3.params: uninitialized dev.cxl.0.tc.3.refcount: 0 dev.cxl.0.tc.3.flags: 0 dev.cxl.0.tc.3.state: 0 dev.cxl.0.tc.2.params: uninitialized dev.cxl.0.tc.2.refcount: 0 dev.cxl.0.tc.2.flags: 0 dev.cxl.0.tc.2.state: 0 dev.cxl.0.tc.1.params: uninitialized dev.cxl.0.tc.1.refcount: 0 dev.cxl.0.tc.1.flags: 0 dev.cxl.0.tc.1.state: 0 dev.cxl.0.tc.0.params: uninitialized dev.cxl.0.tc.0.refcount: 0 dev.cxl.0.tc.0.flags: 0 dev.cxl.0.tc.0.state: 0 dev.cxl.0.tc.burstsize: 0 dev.cxl.0.tc.pktsize: 0 dev.cxl.0.rx_c_chan: 0 dev.cxl.0.rx_e_chan_map: 1 dev.cxl.0.mps_bg_map: 1 dev.cxl.0.max_speed: 10 dev.cxl.0.lpacaps: 458752 dev.cxl.0.acaps: 268435460 dev.cxl.0.pcaps: 269418502 dev.cxl.0.rcaps: 270532612 dev.cxl.0.force_fec: -1 dev.cxl.0.autoneg: -1 dev.cxl.0.module_fec: n/a dev.cxl.0.requested_fec: 20 dev.cxl.0.link_fec: 4 dev.cxl.0.pause_settings: 0 dev.cxl.0.linkdnrc: n/a dev.cxl.0.qsize_txq: 1024 dev.cxl.0.qsize_rxq: 1024 dev.cxl.0.holdoff_pktc_idx: -1 dev.cxl.0.holdoff_tmr_idx: 1 dev.cxl.0.tx_vm_wr: 0 dev.cxl.0.rsrv_noflowq: 0 dev.cxl.0.rss_size: 128 dev.cxl.0.rss_base: 0 dev.cxl.0.first_txq: 0 dev.cxl.0.first_rxq: 0 dev.cxl.0.ntxq: 4 dev.cxl.0.nrxq: 4 dev.cxl.0.viid: 2372 dev.cxl.0.txq.3.vxlan_txcsum: 0 dev.cxl.0.txq.3.vxlan_tso_wrs: 0 dev.cxl.0.txq.3.raw_wrs: 0 dev.cxl.0.txq.3.txpkts_flush: 24 dev.cxl.0.txq.3.txpkts1_pkts: 919 dev.cxl.0.txq.3.txpkts0_pkts: 1881 dev.cxl.0.txq.3.txpkts1_wrs: 154 dev.cxl.0.txq.3.txpkts0_wrs: 246 dev.cxl.0.txq.3.txpkt_wrs: 4638655 dev.cxl.0.txq.3.sgl_wrs: 2407744 dev.cxl.0.txq.3.imm_wrs: 2231313 dev.cxl.0.txq.3.tso_wrs: 0 dev.cxl.0.txq.3.vlan_insertion: 0 dev.cxl.0.txq.3.txcsum: 662 dev.cxl.0.txq.3.tc: -1 dev.cxl.0.txq.3.mp_ring.cons_idle2: 378 dev.cxl.0.txq.3.mp_ring.cons_idle: 4631431 dev.cxl.0.txq.3.mp_ring.stalls: 0 dev.cxl.0.txq.3.mp_ring.abdications: 0 dev.cxl.0.txq.3.mp_ring.not_consumer: 9659 dev.cxl.0.txq.3.mp_ring.takeovers: 0 dev.cxl.0.txq.3.mp_ring.consumer3: 376 dev.cxl.0.txq.3.mp_ring.consumer2: 2 dev.cxl.0.txq.3.mp_ring.fast_consumer: 4631421 dev.cxl.0.txq.3.mp_ring.consumed: 4641458 dev.cxl.0.txq.3.mp_ring.dropped: 0 dev.cxl.0.txq.3.mp_ring.state: 463863545964 dev.cxl.0.txq.3.sidx: 1023 dev.cxl.0.txq.3.pidx: 973 dev.cxl.0.txq.3.cidx: 951 dev.cxl.0.txq.3.cntxt_id: 27 dev.cxl.0.txq.3.abs_id: 27 dev.cxl.0.txq.3.dmalen: 65536 dev.cxl.0.txq.3.ba: 304218112 dev.cxl.0.txq.2.vxlan_txcsum: 0 dev.cxl.0.txq.2.vxlan_tso_wrs: 0 dev.cxl.0.txq.2.raw_wrs: 0 dev.cxl.0.txq.2.txpkts_flush: 24 dev.cxl.0.txq.2.txpkts1_pkts: 926 dev.cxl.0.txq.2.txpkts0_pkts: 1761 dev.cxl.0.txq.2.txpkts1_wrs: 129 dev.cxl.0.txq.2.txpkts0_wrs: 235 dev.cxl.0.txq.2.txpkt_wrs: 4646429 dev.cxl.0.txq.2.sgl_wrs: 3128918 dev.cxl.0.txq.2.imm_wrs: 1517875 dev.cxl.0.txq.2.tso_wrs: 0 dev.cxl.0.txq.2.vlan_insertion: 0 dev.cxl.0.txq.2.txcsum: 5925 dev.cxl.0.txq.2.tc: -1 dev.cxl.0.txq.2.mp_ring.cons_idle2: 458 dev.cxl.0.txq.2.mp_ring.cons_idle: 4639888 dev.cxl.0.txq.2.mp_ring.stalls: 0 dev.cxl.0.txq.2.mp_ring.abdications: 0 dev.cxl.0.txq.2.mp_ring.not_consumer: 8775 dev.cxl.0.txq.2.mp_ring.takeovers: 0 dev.cxl.0.txq.2.mp_ring.consumer3: 459 dev.cxl.0.txq.2.mp_ring.consumer2: 0 dev.cxl.0.txq.2.mp_ring.fast_consumer: 4639882 dev.cxl.0.txq.2.mp_ring.consumed: 4649116 dev.cxl.0.txq.2.mp_ring.dropped: 0 dev.cxl.0.txq.2.mp_ring.state: 2594199831132 dev.cxl.0.txq.2.sidx: 1023 dev.cxl.0.txq.2.pidx: 31 dev.cxl.0.txq.2.cidx: 26 dev.cxl.0.txq.2.cntxt_id: 26 dev.cxl.0.txq.2.abs_id: 26 dev.cxl.0.txq.2.dmalen: 65536 dev.cxl.0.txq.2.ba: 376111104 dev.cxl.0.txq.1.vxlan_txcsum: 0 dev.cxl.0.txq.1.vxlan_tso_wrs: 0 dev.cxl.0.txq.1.raw_wrs: 0 dev.cxl.0.txq.1.txpkts_flush: 14 dev.cxl.0.txq.1.txpkts1_pkts: 1260 dev.cxl.0.txq.1.txpkts0_pkts: 2118 dev.cxl.0.txq.1.txpkts1_wrs: 195 dev.cxl.0.txq.1.txpkts0_wrs: 266 dev.cxl.0.txq.1.txpkt_wrs: 4664613 dev.cxl.0.txq.1.sgl_wrs: 2593073 dev.cxl.0.txq.1.imm_wrs: 2072003 dev.cxl.0.txq.1.tso_wrs: 0 dev.cxl.0.txq.1.vlan_insertion: 0 dev.cxl.0.txq.1.txcsum: 598 dev.cxl.0.txq.1.tc: -1 dev.cxl.0.txq.1.mp_ring.cons_idle2: 519 dev.cxl.0.txq.1.mp_ring.cons_idle: 4654682 dev.cxl.0.txq.1.mp_ring.stalls: 0 dev.cxl.0.txq.1.mp_ring.abdications: 0 dev.cxl.0.txq.1.mp_ring.not_consumer: 12800 dev.cxl.0.txq.1.mp_ring.takeovers: 0 dev.cxl.0.txq.1.mp_ring.consumer3: 516 dev.cxl.0.txq.1.mp_ring.consumer2: 3 dev.cxl.0.txq.1.mp_ring.fast_consumer: 4654681 dev.cxl.0.txq.1.mp_ring.consumed: 4668000 dev.cxl.0.txq.1.mp_ring.dropped: 0 dev.cxl.0.txq.1.mp_ring.state: 219046674483 dev.cxl.0.txq.1.sidx: 1023 dev.cxl.0.txq.1.pidx: 1001 dev.cxl.0.txq.1.cidx: 1000 dev.cxl.0.txq.1.cntxt_id: 25 dev.cxl.0.txq.1.abs_id: 25 dev.cxl.0.txq.1.dmalen: 65536 dev.cxl.0.txq.1.ba: 304152576 dev.cxl.0.txq.0.vxlan_txcsum: 0 dev.cxl.0.txq.0.vxlan_tso_wrs: 0 dev.cxl.0.txq.0.raw_wrs: 0 dev.cxl.0.txq.0.txpkts_flush: 21 dev.cxl.0.txq.0.txpkts1_pkts: 717 dev.cxl.0.txq.0.txpkts0_pkts: 1328 dev.cxl.0.txq.0.txpkts1_wrs: 121 dev.cxl.0.txq.0.txpkts0_wrs: 172 dev.cxl.0.txq.0.txpkt_wrs: 4028347 dev.cxl.0.txq.0.sgl_wrs: 2240846 dev.cxl.0.txq.0.imm_wrs: 1787797 dev.cxl.0.txq.0.tso_wrs: 0 dev.cxl.0.txq.0.vlan_insertion: 0 dev.cxl.0.txq.0.txcsum: 2122 dev.cxl.0.txq.0.tc: -1 dev.cxl.0.txq.0.mp_ring.cons_idle2: 401 dev.cxl.0.txq.0.mp_ring.cons_idle: 4023203 dev.cxl.0.txq.0.mp_ring.stalls: 0 dev.cxl.0.txq.0.mp_ring.abdications: 0 dev.cxl.0.txq.0.mp_ring.not_consumer: 6799 dev.cxl.0.txq.0.mp_ring.takeovers: 0 dev.cxl.0.txq.0.mp_ring.consumer3: 400 dev.cxl.0.txq.0.mp_ring.consumer2: 1 dev.cxl.0.txq.0.mp_ring.fast_consumer: 4023202 dev.cxl.0.txq.0.mp_ring.consumed: 4030402 dev.cxl.0.txq.0.mp_ring.dropped: 0 dev.cxl.0.txq.0.mp_ring.state: 3457501430565 dev.cxl.0.txq.0.sidx: 1023 dev.cxl.0.txq.0.pidx: 126 dev.cxl.0.txq.0.cidx: 118 dev.cxl.0.txq.0.cntxt_id: 24 dev.cxl.0.txq.0.abs_id: 24 dev.cxl.0.txq.0.dmalen: 65536 dev.cxl.0.txq.0.ba: 358457344 dev.cxl.0.rxq.3.vxlan_rxcsum: 0 dev.cxl.0.rxq.3.vlan_extraction: 0 dev.cxl.0.rxq.3.rxcsum: 8468401 dev.cxl.0.rxq.3.lro_flushed: 0 dev.cxl.0.rxq.3.lro_queued: 0 dev.cxl.0.rxq.3.cidx: 757 dev.cxl.0.rxq.3.cntxt_id: 12 dev.cxl.0.rxq.3.abs_id: 12 dev.cxl.0.rxq.3.dmalen: 65536 dev.cxl.0.rxq.3.ba: 358350848 dev.cxl.0.rxq.3.fl.cluster_fast_recycled: 831 dev.cxl.0.rxq.3.fl.cluster_recycled: 2780573 dev.cxl.0.rxq.3.fl.cluster_allocated: 219644 dev.cxl.0.rxq.3.fl.pidx: 800 dev.cxl.0.rxq.3.fl.rx_offset: 1792 dev.cxl.0.rxq.3.fl.cidx: 808 dev.cxl.0.rxq.3.fl.packing: 1 dev.cxl.0.rxq.3.fl.padding: 1 dev.cxl.0.rxq.3.fl.cntxt_id: 23 dev.cxl.0.rxq.3.fl.dmalen: 8192 dev.cxl.0.rxq.3.fl.ba: 358416384 dev.cxl.0.rxq.2.vxlan_rxcsum: 0 dev.cxl.0.rxq.2.vlan_extraction: 0 dev.cxl.0.rxq.2.rxcsum: 7525378 dev.cxl.0.rxq.2.lro_flushed: 0 dev.cxl.0.rxq.2.lro_queued: 0 dev.cxl.0.rxq.2.cidx: 956 dev.cxl.0.rxq.2.cntxt_id: 11 dev.cxl.0.rxq.2.abs_id: 11 dev.cxl.0.rxq.2.dmalen: 65536 dev.cxl.0.rxq.2.ba: 358260736 dev.cxl.0.rxq.2.fl.cluster_fast_recycled: 1606 dev.cxl.0.rxq.2.fl.cluster_recycled: 2356578 dev.cxl.0.rxq.2.fl.cluster_allocated: 203736 dev.cxl.0.rxq.2.fl.pidx: 584 dev.cxl.0.rxq.2.fl.rx_offset: 1536 dev.cxl.0.rxq.2.fl.cidx: 596 dev.cxl.0.rxq.2.fl.packing: 1 dev.cxl.0.rxq.2.fl.padding: 1 dev.cxl.0.rxq.2.fl.cntxt_id: 22 dev.cxl.0.rxq.2.fl.dmalen: 8192 dev.cxl.0.rxq.2.fl.ba: 358326272 dev.cxl.0.rxq.1.vxlan_rxcsum: 0 dev.cxl.0.rxq.1.vlan_extraction: 0 dev.cxl.0.rxq.1.rxcsum: 8143582 dev.cxl.0.rxq.1.lro_flushed: 0 dev.cxl.0.rxq.1.lro_queued: 0 dev.cxl.0.rxq.1.cidx: 318 dev.cxl.0.rxq.1.cntxt_id: 10 dev.cxl.0.rxq.1.abs_id: 10 dev.cxl.0.rxq.1.dmalen: 65536 dev.cxl.0.rxq.1.ba: 358170624 dev.cxl.0.rxq.1.fl.cluster_fast_recycled: 951 dev.cxl.0.rxq.1.fl.cluster_recycled: 2598029 dev.cxl.0.rxq.1.fl.cluster_allocated: 200964 dev.cxl.0.rxq.1.fl.pidx: 864 dev.cxl.0.rxq.1.fl.rx_offset: 1536 dev.cxl.0.rxq.1.fl.cidx: 879 dev.cxl.0.rxq.1.fl.packing: 1 dev.cxl.0.rxq.1.fl.padding: 1 dev.cxl.0.rxq.1.fl.cntxt_id: 21 dev.cxl.0.rxq.1.fl.dmalen: 8192 dev.cxl.0.rxq.1.fl.ba: 358236160 dev.cxl.0.rxq.0.vxlan_rxcsum: 0 dev.cxl.0.rxq.0.vlan_extraction: 0 dev.cxl.0.rxq.0.rxcsum: 6572711 dev.cxl.0.rxq.0.lro_flushed: 0 dev.cxl.0.rxq.0.lro_queued: 0 dev.cxl.0.rxq.0.cidx: 26 dev.cxl.0.rxq.0.cntxt_id: 9 dev.cxl.0.rxq.0.abs_id: 9 dev.cxl.0.rxq.0.dmalen: 65536 dev.cxl.0.rxq.0.ba: 358096896 dev.cxl.0.rxq.0.fl.cluster_fast_recycled: 2183 dev.cxl.0.rxq.0.fl.cluster_recycled: 2054949 dev.cxl.0.rxq.0.fl.cluster_allocated: 154708 dev.cxl.0.rxq.0.fl.pidx: 8 dev.cxl.0.rxq.0.fl.rx_offset: 1536 dev.cxl.0.rxq.0.fl.cidx: 21 dev.cxl.0.rxq.0.fl.packing: 1 dev.cxl.0.rxq.0.fl.padding: 1 dev.cxl.0.rxq.0.fl.cntxt_id: 20 dev.cxl.0.rxq.0.fl.dmalen: 8192 dev.cxl.0.rxq.0.fl.ba: 358162432 dev.cxl.0.%parent: t5nex0 dev.cxl.0.%pnpinfo: dev.cxl.0.%location: port=0 dev.cxl.0.%driver: cxl dev.cxl.0.%desc: port 0 hw.cxgbe.tx_coalesce_gap: 5 hw.cxgbe.tx_coalesce_pkts: 32 hw.cxgbe.tx_coalesce: 1 hw.cxgbe.defrags: 0 hw.cxgbe.pullups: 11 hw.cxgbe.lro_mbufs: 0 hw.cxgbe.lro_entries: 8 hw.cxgbe.tscale: 1 hw.cxgbe.safest_rx_cluster: 4096 hw.cxgbe.largest_rx_cluster: 16384 hw.cxgbe.fl_pack: -1 hw.cxgbe.buffer_packing: -1 hw.cxgbe.cong_drop: 0 hw.cxgbe.spg_len: 64 hw.cxgbe.fl_pad: -1 hw.cxgbe.fl_pktshift: 0 hw.cxgbe.nm_txcsum: 0 hw.cxgbe.nm_split_rss: 0 hw.cxgbe.lazy_tx_credit_flush: 1 hw.cxgbe.starve_fl: 0 hw.cxgbe.nm_cong_drop: 1 hw.cxgbe.nm_holdoff_tmr_idx: 2 hw.cxgbe.nm_rx_nframes: 64 hw.cxgbe.nm_rx_ndesc: 256 hw.cxgbe.nm_black_hole: 0 hw.cxgbe.tls.combo_wrs: 0 hw.cxgbe.tls.inline_keys: 0 hw.cxgbe.kern_tls: 0 hw.cxgbe.cop_managed_offloading: 0 hw.cxgbe.drop_pkts_with_l4_errors: 0 hw.cxgbe.drop_pkts_with_l3_errors: 1 hw.cxgbe.drop_pkts_with_l2_errors: 1 hw.cxgbe.drop_ip_fragments: 0 hw.cxgbe.attack_filter: 1 hw.cxgbe.tx_vm_wr: 0 hw.cxgbe.reset_on_fatal_err: 0 hw.cxgbe.panic_on_fatal_err: 0 hw.cxgbe.pcie_relaxed_ordering: 0 hw.cxgbe.num_vis: 1 hw.cxgbe.fcoecaps_allowed: 0 hw.cxgbe.iscsicaps_allowed: 0 hw.cxgbe.cryptocaps_allowed: -1 hw.cxgbe.rdmacaps_allowed: 0 hw.cxgbe.toecaps_allowed: 0 hw.cxgbe.niccaps_allowed: 33 hw.cxgbe.switchcaps_allowed: 3 hw.cxgbe.linkcaps_allowed: 0 hw.cxgbe.nbmcaps_allowed: 0 hw.cxgbe.fw_install: 1 hw.cxgbe.autoneg: -1 hw.cxgbe.force_fec: -1 hw.cxgbe.fec: -1 hw.cxgbe.pause_settings: 0 hw.cxgbe.config_file: default hw.cxgbe.interrupt_types: 7 hw.cxgbe.qsize_rxq: 1024 hw.cxgbe.qsize_txq: 1024 hw.cxgbe.holdoff_pktc_idx: -1 hw.cxgbe.holdoff_timer_idx: 1 hw.cxgbe.nnmrxq_vi: 2 hw.cxgbe.nnmtxq_vi: 2 hw.cxgbe.nnmrxq: 8 hw.cxgbe.nnmtxq: 8 hw.cxgbe.native_netmap: 2 hw.cxgbe.holdoff_pktc_idx_ofld: -1 hw.cxgbe.holdoff_timer_idx_ofld: 1 hw.cxgbe.nofldrxq_vi: 1 hw.cxgbe.nofldtxq_vi: 1 hw.cxgbe.nofldrxq: 2 hw.cxgbe.nofldtxq: 8 hw.cxgbe.rsrv_noflowq: 0 hw.cxgbe.nrxq_vi: 1 hw.cxgbe.ntxq_vi: 1 hw.cxgbe.nrxq: 4 hw.cxgbe.ntxq: 4 hw.cxgbe.toe.tls_rx_timeout: 5 hw.cxgbe.toe.rexmt_backoff.15: -1 hw.cxgbe.toe.rexmt_backoff.14: -1 hw.cxgbe.toe.rexmt_backoff.13: -1 hw.cxgbe.toe.rexmt_backoff.12: -1 hw.cxgbe.toe.rexmt_backoff.11: -1 hw.cxgbe.toe.rexmt_backoff.10: -1 hw.cxgbe.toe.rexmt_backoff.9: -1 hw.cxgbe.toe.rexmt_backoff.8: -1 hw.cxgbe.toe.rexmt_backoff.7: -1 hw.cxgbe.toe.rexmt_backoff.6: -1 hw.cxgbe.toe.rexmt_backoff.5: -1 hw.cxgbe.toe.rexmt_backoff.4: -1 hw.cxgbe.toe.rexmt_backoff.3: -1 hw.cxgbe.toe.rexmt_backoff.2: -1 hw.cxgbe.toe.rexmt_backoff.1: -1 hw.cxgbe.toe.rexmt_backoff.0: -1 hw.cxgbe.toe.rexmt_count: 0 hw.cxgbe.toe.rexmt_max: 0 hw.cxgbe.toe.rexmt_min: 0 hw.cxgbe.toe.keepalive_count: 0 hw.cxgbe.toe.keepalive_interval: 0 hw.cxgbe.toe.keepalive_idle: 0 hw.cxgbe.clip_db_auto: 1 From nobody Fri Oct 21 18:13:14 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4MvCJJ5jxpz4gP8Y for ; Fri, 21 Oct 2022 18:13:20 +0000 (UTC) (envelope-from nparhar@gmail.com) Received: from mail-oi1-x233.google.com (mail-oi1-x233.google.com [IPv6:2607:f8b0:4864:20::233]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4MvCJH6zr2z3F5Y for ; Fri, 21 Oct 2022 18:13:19 +0000 (UTC) (envelope-from nparhar@gmail.com) Received: by mail-oi1-x233.google.com with SMTP id w196so4098303oiw.8 for ; Fri, 21 Oct 2022 11:13:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=kjCKIb0CxugqMgFTWxs42He/05mt+tA1fJ4gaFLMFrU=; b=TFHStUhmNQA/tCHbAfwhRMhPV+iOfnINnp4mDg8t6BBiXwJiB04ggEXGZKr0cM9WNr VGKqM3u9MqYI7UmN4ykyfhFVlqAd7zsaXpVYIwmpU4OjvZFSfOW6gqhAR/ApIPp09Dv5 4kx7k6kRqri3rgRkQjMb+uvDa3qwmpTrBlAgUA6S7pR/zxijbjxP6igX1fiXprcc0n9e ev/SFy5zkhdxY3deHdx/4OcF/tqf097LRhAKJNP4FF2NqJRGKK3CxlJGlUxDjqWyvdNG LbqpSDi0su5LU1axudI6JnCuHppMc3XbG7aqElRVCHCr8RZrP4NVBldeOM6NYv1s3FuK nTOA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=kjCKIb0CxugqMgFTWxs42He/05mt+tA1fJ4gaFLMFrU=; b=rUlOqOQ4twPEkIcm0mpqC396Yj7sx5ZKDwJZAzGP9qSbzomveh+7BcRm6lWqiyLCc9 y7fU3+tWr/QrZYsad0ZCu0uEtGzJIRRjCS2SIoVevjztLXA4hPuX/vN6/ivf640DacEu ImWvy+gqfEkK7k2BUoszcHYYTN+T8TRQ6Mm1EY8JtAykwuBze/suRD+63V2YoYEQjR+w dpokcZkfSAlePDSBQjyZ7NhRBbxIdQNbD9ky50mEdYAlNfyPdK2rPeJwuC8wwMPWAZB1 gzWcihItX7SQ8XyWZN4LRky40pjoAEuG6Nh7ZIrq3SaWJb8Taj2P9+PfJ1BaEtVHk7L9 UJDA== X-Gm-Message-State: ACrzQf3X+vkitqLS/uElnLCCzfOID/qOiwpm/dlNW1HPGtS/5ozPLYTm 2O/qhE83ggZqImjFSEbsN94= X-Google-Smtp-Source: AMsMyM67g6XCmXHemcEjxceDvY8uc+2CUv5LLGRzhy8IjRvYUxbnW4lLYYwsxgDDHPlwOfUbtxYZAg== X-Received: by 2002:aca:b05:0:b0:355:70c3:5c4a with SMTP id 5-20020aca0b05000000b0035570c35c4amr4312666oil.142.1666375998295; Fri, 21 Oct 2022 11:13:18 -0700 (PDT) Received: from [10.192.161.10] (stargate.chelsio.com. [12.32.117.8]) by smtp.googlemail.com with ESMTPSA id z6-20020a056808028600b003546fada8f6sm1377737oic.12.2022.10.21.11.13.17 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 21 Oct 2022 11:13:17 -0700 (PDT) Message-ID: Date: Fri, 21 Oct 2022 11:13:14 -0700 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:102.0) Gecko/20100101 Thunderbird/102.3.2 Subject: Re: Chelsio Forwarding performance and RELENG_13 vs RELENG_12 Content-Language: en-US To: mike tancsa , Freebsd performance References: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> <92cdf4b8-2209-ec44-8151-a59b9e8f1504@gmail.com> From: Navdeep Parhar In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64 X-Rspamd-Queue-Id: 4MvCJH6zr2z3F5Y X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b=TFHStUhm; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of nparhar@gmail.com designates 2607:f8b0:4864:20::233 as permitted sender) smtp.mailfrom=nparhar@gmail.com X-Spamd-Result: default: False [-3.90 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36:c]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; MIME_BASE64_TEXT(0.10)[]; MIME_GOOD(-0.10)[text/plain]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; FROM_HAS_DN(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-performance@freebsd.org]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::233:from]; TO_DN_ALL(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_TLS_LAST(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; FROM_EQ_ENVFROM(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+]; MLMMJ_DEST(0.00)[freebsd-performance@freebsd.org] X-ThisMailContainsUnwantedMimeParts: N T24gMTAvMjEvMjIgMTA6NTcgQU0sIG1pa2UgdGFuY3NhIHdyb3RlOg0KPiBPbiAxMC8yMS8y MDIyIDE6MzEgUE0sIE5hdmRlZXAgUGFyaGFyIHdyb3RlOg0KPj4gT24gMTAvMTgvMjIgMTI6 MTYgUE0sIG1pa2UgdGFuY3NhIHdyb3RlOg0KPj4+IEkgdXBkYXRlZCBhIFJFTEVOR18xMiBy b3V0ZXIgYWxvbmcgd2l0aCB0aGUgaGFyZHdhcmUgdG8gUkVMRU5HXzEzIA0KPj4+IChvY3Qg MTR0aCBrZXJuZWwpIGFuZCB3YXMgc3VycHJpc2VkIHRvIHNlZSBhbiBpbmNyZWFzZSBpbiAN Cj4+PiBkZXYuY3hsLjAuc3RhdHMucnhfb3ZmbG93MCBhdCBhIHNvbWV3aGF0IGZhc3RlciBy YXRlIHRoYW4gSSB3YXMgDQo+Pj4gc2VlaW5nIG9uIHRoZSBvbGRlciBzbGlnaHRseSBzbG93 ZXIgaGFyZHdhcmUgdW5kZXIgYWJvdXQgdGhlIHNhbWUgDQo+Pj4gbG9hZC4gKFhlb24oUikg RS0yMjI2RyBDUFUgQCAzLjQwR0h6KSB2cyBhIDQgY29yZSBYZW9uIHNhbWUgZnJlcSwgDQo+ Pj4gc2FtZSBtZW1vcnkgc3BlZWQuIEFib3V0IDE1MEtwcHMgaW4gYW5kIG91dCBhbmQgYSAx R2IvcyB0aHJvdWdocHV0DQo+Pj4NCj4+PiBsb2FkZXIuY29uZiBpcyB0aGUgc2FtZQ0KPj4+ DQo+Pj4NCj4+PiBody5jeGdiZS50b2VjYXBzX2FsbG93ZWQ9IjAiDQo+Pj4gaHcuY3hnYmUu cmRtYWNhcHNfYWxsb3dlZD0iMCINCj4+PiBody5jeGdiZS5pc2NzaWNhcHNfYWxsb3dlZD0i MCINCj4+PiBody5jeGdiZS5mY29lY2Fwc19hbGxvd2VkPSIwIg0KPj4+IGh3LmN4Z2JlLnBh dXNlX3NldHRpbmdzPSIwIg0KPj4+IGh3LmN4Z2JlLmF0dGFja19maWx0ZXI9IjEiDQo+Pj4g aHcuY3hnYmUuZHJvcF9wa3RzX3dpdGhfbDNfZXJyb3JzPSIxIg0KPj4+DQo+Pj4gQXMgdGhl cmUgaXMgYSBsYXJnZSByb3V0aW5nIHRhYmxlLCBJIGRvIGhhdmUNCj4+Pg0KPj4+IFtmaWJf YWxnb10gaW5ldC4wIChyYWRpeDRfbG9ja2xlc3MjNDYpIHJlYnVpbGRfZmRfZmxtOiBzd2l0 Y2hpbmcgYWxnbyANCj4+PiB0byByYWRpeDQNCj4+PiBbZmliX2FsZ29dIGluZXQ2LjAgKHJh ZGl4Nl9sb2NrbGVzcyM1OCkgcmVidWlsZF9mZF9mbG06IHN3aXRjaGluZyANCj4+PiBhbGdv IHRvIHJhZGl4Ng0KPj4+DQo+Pj4ga2lja2luZyBpbi4NCj4+Pg0KPj4+IGFuZCBzeXNjdGwu Y29uZg0KPj4+DQo+Pj4gbmV0LnJvdXRlLm11bHRpcGF0aD0wDQo+Pj4NCj4+PiBuZXQuaW5l dC5pcC5yZWRpcmVjdD0wDQo+Pj4gbmV0LmluZXQ2LmlwNi5yZWRpcmVjdD0wDQo+Pj4ga2Vy bi5pcGMubWF4c29ja2J1Zj0xNjc3NzIxNg0KPj4+IG5ldC5pbmV0LnRjcC5ibGFja2hvbGU9 MQ0KPj4+DQo+Pj4gQXJlIHRoZXJlIGFueSBvdGhlciB0d2Vha3MgdGhhdCBjYW4gYmUgZG9u ZSBpbiBvcmRlciB0byBiZXR0ZXIgDQo+Pj4gZm9yd2FyZGluZyBwZXJmb3JtYW5jZSA/IEkg ZG8gc2VlIGF0IGJvb3R1cCB0aW1lDQo+Pj4NCj4+PiBjeGwwOiBucnhxICg2KSwgaHcgUlNT IHRhYmxlIHNpemUgKDEyOCk7IGV4cGVjdCB1bmV2ZW4gdHJhZmZpYyANCj4+PiBkaXN0cmli dXRpb24uDQo+Pj4gY3hsMTogbnJ4cSAoNiksIGh3IFJTUyB0YWJsZSBzaXplICgxMjgpOyBl eHBlY3QgdW5ldmVuIHRyYWZmaWMgDQo+Pj4gZGlzdHJpYnV0aW9uLg0KPj4+IGN4bDM6IG5y eHEgKDYpLCBodyBSU1MgdGFibGUgc2l6ZSAoMTI4KTsgZXhwZWN0IHVuZXZlbiB0cmFmZmlj IA0KPj4+IGRpc3RyaWJ1dGlvbi4NCj4+Pg0KPj4+IFRoZSBjcHUgaXMgNiBjb3JlLiBObyBI VCBlbmFibGVkDQo+Pg0KPj4gVGhlIG9sZCBzeXN0ZW0gd2FzIDQtY29yZSBzbyBpdCBtdXN0 IGhhdmUgdXNlZCA0IHF1ZXVlcy7CoCBDYW4geW91IA0KPj4gcGxlYXNlIHRyeSB0aGF0IG9u IHRoZSBuZXcgc3lzdGVtIGFuZCBzZWUgaG93IGl0IGRvZXM/DQo+Pg0KPj4gaHcuY3hnYmUu bnR4cT00DQo+PiBody5jeGdiZS5ucnhxPTQNCj4+DQo+IFRoYW5rcyBOYXZkZWVwIQ0KPiAN Cj4gVW5mb3J0dW5hdGVseSwgc3RpbGwgdGhlIG9kZCBkcm9wcGVkIHBhY2tldCA6KA0KDQpD YW4geW91IHRyeSBpbmNyZWFzaW5nIHRoZSBzaXplIG9mIHRoZSBxdWV1ZXM/DQoNCmh3LmN4 Z2JlLnFzaXplX3R4cT0yMDQ4DQpody5jeGdiZS5xc2l6ZV9yeHE9MjA0OA0KDQpUaGUgc3Rh dHMgc2hvdyB0aGF0IHlvdSBhcmUgdXNpbmcgTVRVIDE1MDAuICBJZiB5b3Ugd2VyZSB1c2lu ZyBNVFUgOTAwMCANCkknZCBhbHNvIGhhdmUgc3VnZ2VzdGVkIHNldHRpbmcgbGFyZ2VzdF9y eF9jbHVzdGVyIHRvIDRLLg0KDQpody5jeGdiZS5sYXJnZXN0X3J4X2NsdXN0ZXI9NDA5Ng0K DQpSZWdhcmRzLA0KTmF2ZGVlcA0K From nobody Fri Oct 21 18:15:06 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4MvCLN1cppz4gNq4 for ; Fri, 21 Oct 2022 18:15:08 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4MvCLL75n2z3Fd0 for ; Fri, 21 Oct 2022 18:15:06 +0000 (UTC) (envelope-from mike@sentex.net) Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.16.1/8.16.1) with ESMTPS id 29LIF6o3015705 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 21 Oct 2022 14:15:06 -0400 (EDT) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:f808:cefc:42f3:2221] ([IPv6:2607:f3e0:0:4:f808:cefc:42f3:2221]) by pyroxene2a.sentex.ca (8.16.1/8.15.2) with ESMTPS id 29LIF5K7017931 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Fri, 21 Oct 2022 14:15:06 -0400 (EDT) (envelope-from mike@sentex.net) Message-ID: Date: Fri, 21 Oct 2022 14:15:06 -0400 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.3.3 Subject: Re: Chelsio Forwarding performance and RELENG_13 vs RELENG_12 Content-Language: en-US To: Navdeep Parhar , Freebsd performance References: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> <92cdf4b8-2209-ec44-8151-a59b9e8f1504@gmail.com> From: mike tancsa In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 64.7.153.18 X-Rspamd-Queue-Id: 4MvCLL75n2z3Fd0 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net X-Spamd-Result: default: False [-3.40 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.999]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_ALL(0.00)[]; FREEMAIL_TO(0.00)[gmail.com,freebsd.org]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; R_DKIM_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-performance@freebsd.org]; TO_DN_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[mike]; RCPT_COUNT_TWO(0.00)[2]; DMARC_NA(0.00)[sentex.net]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; ARC_NA(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On 10/21/2022 2:13 PM, Navdeep Parhar wrote: > >>> hw.cxgbe.ntxq=4 >>> hw.cxgbe.nrxq=4 >>> >> Thanks Navdeep! >> >> Unfortunately, still the odd dropped packet :( > > Can you try increasing the size of the queues? > > hw.cxgbe.qsize_txq=2048 > hw.cxgbe.qsize_rxq=2048 > > The stats show that you are using MTU 1500.  If you were using MTU > 9000 I'd also have suggested setting largest_rx_cluster to 4K. > > hw.cxgbe.largest_rx_cluster=4096 > Thanks again, just 1500 MTU.  Shall I keep nt and nrxq at 4 still ?     ---Mike From nobody Fri Oct 21 18:30:28 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4MvCh92DdWz4gR0T for ; Fri, 21 Oct 2022 18:30:33 +0000 (UTC) (envelope-from nparhar@gmail.com) Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4MvCh80kdcz3HM8 for ; Fri, 21 Oct 2022 18:30:32 +0000 (UTC) (envelope-from nparhar@gmail.com) Received: by mail-pg1-x530.google.com with SMTP id 78so3230939pgb.13 for ; Fri, 21 Oct 2022 11:30:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=9aI996t9jXjg8ISkYwoRrlxJi6CJfG1GRhVK1yo/mG0=; b=pO6FE1/XU9sjmw0/55g+S/f52Q65lQSF962c10T7/H34eI174/8JBNp71J7KDKxDbO TdGZRiOgIL8G4evSDdbicV1irsfdX49H4osqdtkCiPM93fkAXxDdcLSl/NCq/2pvQxlO qWIsrwGG1jy+BpInCd7QpJsNdFOprFaCwcKdlq/mkGTrs0VvRUz5QpG3VpNoNRnbUopT FJ682s2U2PJuUqGT4ZDbJmkiUTSBbTGrXJB8rU/hlqKbR9Qedgq+fBaebkneL5HhzgoB 4soeloofkkDsgSTDf8iFotOCXlyv3bgGG7P4k/WctO/Vtk83yzwWeF9DFvoDKhQy3lWL F1TQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=9aI996t9jXjg8ISkYwoRrlxJi6CJfG1GRhVK1yo/mG0=; b=q9jCVcE2CUQGwdvIs4mzW4bmA/Zep87HqvUSJoELb8qyaR0ZlnsZdbsN2FVK8e2yd/ OZWPpG6dxWn+DaG+Ei5lc4zNGG6oQTiEdQZjQES7LodAjK8LeQSOqeJXmUkAX0JFo9R9 DFbDBVtji4QWIROoxvABv02JPe2locgQpk/xB01JIjsfnIxD6loF+MlLqy3EmV2eh+0s gGd8OtBrLehdGyQQdDYEhRoyGPu5E+yZOIMhqWCZO97IBi3XTFJ2uBeG0aDhm+cYLNtr Mg7yopSSQRjk/HR/4fP6WkVPw0pFvosxQ7rKMG/bo7KChrN7SaBY42wcQ/vb0EnwlWb5 5z0g== X-Gm-Message-State: ACrzQf1n0jn7fi7SHsvd5TiJ0j3jXSXjvf19A834dvvPZVa4LZ7Ge/Ru HwXK4KWxjIbnvI4r4JmNzMhlR8wi+OM= X-Google-Smtp-Source: AMsMyM4VFRO9kFfTEqz6yy+7NE8B0cRMMSK8NsJ3BzrMBu7t4Mf6q3rXaxH2PKNozkvbQDCOWMOYkg== X-Received: by 2002:a63:1b58:0:b0:45f:e7ba:a223 with SMTP id b24-20020a631b58000000b0045fe7baa223mr16593197pgm.548.1666377030506; Fri, 21 Oct 2022 11:30:30 -0700 (PDT) Received: from [10.192.161.10] (stargate.chelsio.com. [12.32.117.8]) by smtp.googlemail.com with ESMTPSA id j12-20020a17090aeb0c00b00205ec23b392sm1937756pjz.12.2022.10.21.11.30.29 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 21 Oct 2022 11:30:30 -0700 (PDT) Message-ID: <8166abfe-a796-2cf0-ade2-de08df8eecd2@gmail.com> Date: Fri, 21 Oct 2022 11:30:28 -0700 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:102.0) Gecko/20100101 Thunderbird/102.4.0 Subject: Re: Chelsio Forwarding performance and RELENG_13 vs RELENG_12 To: mike tancsa , Freebsd performance References: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> <92cdf4b8-2209-ec44-8151-a59b9e8f1504@gmail.com> Content-Language: en-US From: Navdeep Parhar In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: base64 X-Rspamd-Queue-Id: 4MvCh80kdcz3HM8 X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20210112 header.b="pO6FE1/X"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of nparhar@gmail.com designates 2607:f8b0:4864:20::530 as permitted sender) smtp.mailfrom=nparhar@gmail.com X-Spamd-Result: default: False [-3.90 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20210112]; R_SPF_ALLOW(-0.20)[+ip6:2607:f8b0:4000::/36]; MIME_BASE64_TEXT(0.10)[]; MIME_GOOD(-0.10)[text/plain]; ARC_NA(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; FROM_HAS_DN(0.00)[]; PREVIOUSLY_DELIVERED(0.00)[freebsd-performance@freebsd.org]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; TO_MATCH_ENVRCPT_SOME(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US]; RCVD_IN_DNSWL_NONE(0.00)[2607:f8b0:4864:20::530:from]; TO_DN_ALL(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; FREEMAIL_FROM(0.00)[gmail.com]; RCVD_TLS_LAST(0.00)[]; DKIM_TRACE(0.00)[gmail.com:+]; FROM_EQ_ENVFROM(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; FREEMAIL_ENVFROM(0.00)[gmail.com]; MIME_TRACE(0.00)[0:+]; MLMMJ_DEST(0.00)[freebsd-performance@freebsd.org] X-ThisMailContainsUnwantedMimeParts: N T24gMTAvMjEvMjIgMTE6MTUgQU0sIG1pa2UgdGFuY3NhIHdyb3RlOg0KPiBPbiAxMC8yMS8y MDIyIDI6MTMgUE0sIE5hdmRlZXAgUGFyaGFyIHdyb3RlOg0KPj4NCj4+Pj4gaHcuY3hnYmUu bnR4cT00DQo+Pj4+IGh3LmN4Z2JlLm5yeHE9NA0KPj4+Pg0KPj4+IFRoYW5rcyBOYXZkZWVw IQ0KPj4+DQo+Pj4gVW5mb3J0dW5hdGVseSwgc3RpbGwgdGhlIG9kZCBkcm9wcGVkIHBhY2tl dCA6KA0KPj4NCj4+IENhbiB5b3UgdHJ5IGluY3JlYXNpbmcgdGhlIHNpemUgb2YgdGhlIHF1 ZXVlcz8NCj4+DQo+PiBody5jeGdiZS5xc2l6ZV90eHE9MjA0OA0KPj4gaHcuY3hnYmUucXNp emVfcnhxPTIwNDgNCj4+DQo+PiBUaGUgc3RhdHMgc2hvdyB0aGF0IHlvdSBhcmUgdXNpbmcg TVRVIDE1MDAuwqAgSWYgeW91IHdlcmUgdXNpbmcgTVRVIA0KPj4gOTAwMCBJJ2QgYWxzbyBo YXZlIHN1Z2dlc3RlZCBzZXR0aW5nIGxhcmdlc3RfcnhfY2x1c3RlciB0byA0Sy4NCj4+DQo+ PiBody5jeGdiZS5sYXJnZXN0X3J4X2NsdXN0ZXI9NDA5Ng0KPj4NCj4gVGhhbmtzIGFnYWlu LCBqdXN0IDE1MDAgTVRVLsKgIFNoYWxsIEkga2VlcCBudCBhbmQgbnJ4cSBhdCA0IHN0aWxs ID8NCg0KWWVzLCBJIHRoaW5rIDQgcXVldWVzIGFyZSBlbm91Z2ggZm9yIDEwRy4NCg0KUmVn YXJkcywNCk5hdmRlZXANCg== From nobody Fri Oct 21 18:45:37 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4MvD1b2Nbnz4gT3c for ; Fri, 21 Oct 2022 18:45:39 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4MvD1Z42Jzz3LhQ for ; Fri, 21 Oct 2022 18:45:38 +0000 (UTC) (envelope-from mike@sentex.net) Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.16.1/8.16.1) with ESMTPS id 29LIjaqH048702 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Fri, 21 Oct 2022 14:45:37 -0400 (EDT) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:f808:cefc:42f3:2221] ([IPv6:2607:f3e0:0:4:f808:cefc:42f3:2221]) by pyroxene2a.sentex.ca (8.16.1/8.15.2) with ESMTPS id 29LIja8p028076 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Fri, 21 Oct 2022 14:45:36 -0400 (EDT) (envelope-from mike@sentex.net) Message-ID: <39ca9375-e742-618e-5020-dda5fa24ac0a@sentex.net> Date: Fri, 21 Oct 2022 14:45:37 -0400 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.3.3 Subject: Re: Chelsio Forwarding performance and RELENG_13 vs RELENG_12 Content-Language: en-US To: Navdeep Parhar , Freebsd performance References: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> <92cdf4b8-2209-ec44-8151-a59b9e8f1504@gmail.com> <8166abfe-a796-2cf0-ade2-de08df8eecd2@gmail.com> From: mike tancsa In-Reply-To: <8166abfe-a796-2cf0-ade2-de08df8eecd2@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 64.7.153.18 X-Rspamd-Queue-Id: 4MvD1Z42Jzz3LhQ X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net X-Spamd-Result: default: False [-3.40 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_ALL(0.00)[]; FREEMAIL_TO(0.00)[gmail.com,freebsd.org]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; R_DKIM_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-performance@freebsd.org]; TO_DN_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[mike]; RCPT_COUNT_TWO(0.00)[2]; DMARC_NA(0.00)[sentex.net]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; ARC_NA(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On 10/21/2022 2:30 PM, Navdeep Parhar wrote: > On 10/21/22 11:15 AM, mike tancsa wrote: >> On 10/21/2022 2:13 PM, Navdeep Parhar wrote: >>> >>>>> hw.cxgbe.ntxq=4 >>>>> hw.cxgbe.nrxq=4 >>>>> >>>> Thanks Navdeep! >>>> >>>> Unfortunately, still the odd dropped packet :( >>> >>> Can you try increasing the size of the queues? >>> >>> hw.cxgbe.qsize_txq=2048 >>> hw.cxgbe.qsize_rxq=2048 >>> >>> The stats show that you are using MTU 1500.  If you were using MTU >>> 9000 I'd also have suggested setting largest_rx_cluster to 4K. >>> >>> hw.cxgbe.largest_rx_cluster=4096 >>> >> Thanks again, just 1500 MTU.  Shall I keep nt and nrxq at 4 still ? > > Yes, I think 4 queues are enough for 10G. > Sadly, no luck. Still about the same rate of overflows :( dev.cxl.0.stats.rx_trunc3: 0 dev.cxl.0.stats.rx_trunc2: 0 dev.cxl.0.stats.rx_trunc1: 0 dev.cxl.0.stats.rx_trunc0: 5 dev.cxl.0.stats.rx_ovflow3: 0 dev.cxl.0.stats.rx_ovflow2: 0 dev.cxl.0.stats.rx_ovflow1: 0 dev.cxl.0.stats.rx_ovflow0: 61 dev.cxl.0.stats.rx_ppp7: 0 dev.cxl.0.stats.rx_ppp6: 0 dev.cxl.0.stats.rx_ppp5: 0 dev.cxl.0.stats.rx_ppp4: 0 dev.cxl.0.stats.rx_ppp3: 0 dev.cxl.0.stats.rx_ppp2: 0 dev.cxl.0.stats.rx_ppp1: 0 dev.cxl.0.stats.rx_ppp0: 0 dev.cxl.0.stats.rx_pause: 0 dev.cxl.0.stats.rx_frames_1519_max: 0 dev.cxl.0.stats.rx_frames_1024_1518: 25966401 dev.cxl.0.stats.rx_frames_512_1023: 1569927 dev.cxl.0.stats.rx_frames_256_511: 1856460 dev.cxl.0.stats.rx_frames_128_255: 2718106 dev.cxl.0.stats.rx_frames_65_127: 4548626 dev.cxl.0.stats.rx_frames_64: 3044300 dev.cxl.0.stats.rx_runt: 0 dev.cxl.0.stats.rx_symbol_err: 0 dev.cxl.0.stats.rx_len_err: 0 dev.cxl.0.stats.rx_fcs_err: 0 dev.cxl.0.stats.rx_jabber: 0 dev.cxl.0.stats.rx_too_long: 0 dev.cxl.0.stats.rx_ucast_frames: 39618919 dev.cxl.0.stats.rx_mcast_frames: 36716 dev.cxl.0.stats.rx_bcast_frames: 48219 dev.cxl.0.stats.rx_frames: 39703866 dev.cxl.0.stats.rx_octets: 39678756369 dev.cxl.0.stats.tx_ppp7: 0 From nobody Thu Nov 3 18:20:54 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4N3BsL0qkRz4gM3q for ; Thu, 3 Nov 2022 18:21:10 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4N3BsK0fk1z44Cg for ; Thu, 3 Nov 2022 18:21:09 +0000 (UTC) (envelope-from mike@sentex.net) Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.16.1/8.16.1) with ESMTPS id 2A3IKsPv037284 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Thu, 3 Nov 2022 14:20:54 -0400 (EDT) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:8d08:ffbe:d530:da9d] ([IPv6:2607:f3e0:0:4:8d08:ffbe:d530:da9d]) by pyroxene2a.sentex.ca (8.16.1/8.15.2) with ESMTPS id 2A3IKrvj006162 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Thu, 3 Nov 2022 14:20:53 -0400 (EDT) (envelope-from mike@sentex.net) Message-ID: <63424978-a10f-a88b-2b3e-eb80d0f29f51@sentex.net> Date: Thu, 3 Nov 2022 14:20:54 -0400 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.4.1 Subject: Re: Chelsio Forwarding performance and RELENG_13 vs RELENG_12 (solved) Content-Language: en-US From: mike tancsa To: Navdeep Parhar , Freebsd performance References: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> <92cdf4b8-2209-ec44-8151-a59b9e8f1504@gmail.com> <8166abfe-a796-2cf0-ade2-de08df8eecd2@gmail.com> <39ca9375-e742-618e-5020-dda5fa24ac0a@sentex.net> In-Reply-To: <39ca9375-e742-618e-5020-dda5fa24ac0a@sentex.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 64.7.153.18 X-Rspamd-Queue-Id: 4N3BsK0fk1z44Cg X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net X-Spamd-Result: default: False [-3.40 / 15.00]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-1.000]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_ALL(0.00)[]; FREEMAIL_TO(0.00)[gmail.com,freebsd.org]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; R_DKIM_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-performance@freebsd.org]; TO_DN_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[mike]; RCPT_COUNT_TWO(0.00)[2]; DMARC_NA(0.00)[sentex.net]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; ARC_NA(0.00)[] X-ThisMailContainsUnwantedMimeParts: N On 10/21/2022 2:45 PM, mike tancsa wrote: > On 10/21/2022 2:30 PM, Navdeep Parhar wrote: >> On 10/21/22 11:15 AM, mike tancsa wrote: >>> On 10/21/2022 2:13 PM, Navdeep Parhar wrote: >>>> >>>>>> hw.cxgbe.ntxq=4 >>>>>> hw.cxgbe.nrxq=4 >>>>>> >>>>> Thanks Navdeep! >>>>> >>>>> Unfortunately, still the odd dropped packet :( >>>> >>>> Can you try increasing the size of the queues? >>>> >>>> hw.cxgbe.qsize_txq=2048 >>>> hw.cxgbe.qsize_rxq=2048 >>>> >>>> The stats show that you are using MTU 1500.  If you were using MTU >>>> 9000 I'd also have suggested setting largest_rx_cluster to 4K. >>>> >>>> hw.cxgbe.largest_rx_cluster=4096 >>>> >>> Thanks again, just 1500 MTU.  Shall I keep nt and nrxq at 4 still ? >> >> Yes, I think 4 queues are enough for 10G. >> > Sadly, no luck. Still about the same rate of overflows :( > > FYI, I worked around the issue by using two 520-CR NICs instead of the one 540-CR NIC and performance is solid again with no dropped packets     ---Mike From nobody Wed Nov 23 19:40:31 2022 X-Original-To: freebsd-performance@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4NHWh06xl4z4j81l for ; Wed, 23 Nov 2022 19:40:48 +0000 (UTC) (envelope-from mike@sentex.net) Received: from smarthost1.sentex.ca (smarthost1.sentex.ca [IPv6:2607:f3e0:0:1::12]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "smarthost1.sentex.ca", Issuer "R3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4NHWh00gZBz3pBR for ; Wed, 23 Nov 2022 19:40:48 +0000 (UTC) (envelope-from mike@sentex.net) Authentication-Results: mx1.freebsd.org; dkim=none; spf=pass (mx1.freebsd.org: domain of mike@sentex.net designates 2607:f3e0:0:1::12 as permitted sender) smtp.mailfrom=mike@sentex.net; dmarc=none Received: from pyroxene2a.sentex.ca (pyroxene19.sentex.ca [199.212.134.19]) by smarthost1.sentex.ca (8.16.1/8.16.1) with ESMTPS id 2ANJeVbe073228 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=FAIL); Wed, 23 Nov 2022 14:40:32 -0500 (EST) (envelope-from mike@sentex.net) Received: from [IPV6:2607:f3e0:0:4:c0b6:89d8:b838:d5de] ([IPv6:2607:f3e0:0:4:c0b6:89d8:b838:d5de]) by pyroxene2a.sentex.ca (8.16.1/8.15.2) with ESMTPS id 2ANJeVI6061279 (version=TLSv1.3 cipher=TLS_AES_128_GCM_SHA256 bits=128 verify=NO); Wed, 23 Nov 2022 14:40:31 -0500 (EST) (envelope-from mike@sentex.net) Message-ID: Date: Wed, 23 Nov 2022 14:40:31 -0500 List-Id: Performance/tuning List-Archive: https://lists.freebsd.org/archives/freebsd-performance List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-performance@freebsd.org X-BeenThere: freebsd-performance@freebsd.org MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.0 Subject: Re: Chelsio Forwarding performance and RELENG_13 vs RELENG_12 (solved) Content-Language: en-US From: mike tancsa To: Navdeep Parhar , Freebsd performance References: <7b86e3fe-62e4-7b3e-f4bf-30e4894db9db@sentex.net> <92cdf4b8-2209-ec44-8151-a59b9e8f1504@gmail.com> <8166abfe-a796-2cf0-ade2-de08df8eecd2@gmail.com> <39ca9375-e742-618e-5020-dda5fa24ac0a@sentex.net> <63424978-a10f-a88b-2b3e-eb80d0f29f51@sentex.net> In-Reply-To: <63424978-a10f-a88b-2b3e-eb80d0f29f51@sentex.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.84 on 64.7.153.18 X-Spamd-Result: default: False [-3.35 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-0.999]; NEURAL_HAM_SHORT(-0.98)[-0.981]; NEURAL_HAM_LONG(-0.97)[-0.970]; R_SPF_ALLOW(-0.20)[+ip6:2607:f3e0::/32]; RCVD_IN_DNSWL_LOW(-0.10)[199.212.134.19:received]; MIME_GOOD(-0.10)[text/plain]; RCVD_TLS_ALL(0.00)[]; FREEMAIL_TO(0.00)[gmail.com,freebsd.org]; ASN(0.00)[asn:11647, ipnet:2607:f3e0::/32, country:CA]; FROM_EQ_ENVFROM(0.00)[]; MIME_TRACE(0.00)[0:+]; R_DKIM_NA(0.00)[]; MLMMJ_DEST(0.00)[freebsd-performance@freebsd.org]; TO_DN_ALL(0.00)[]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; FREEFALL_USER(0.00)[mike]; RCPT_COUNT_TWO(0.00)[2]; DMARC_NA(0.00)[sentex.net]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_THREE(0.00)[3]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: 4NHWh00gZBz3pBR X-Spamd-Bar: --- X-ThisMailContainsUnwantedMimeParts: N On 11/3/2022 2:20 PM, mike tancsa wrote: > Yes, I think 4 queues are enough for 10G. >>> >> Sadly, no luck. Still about the same rate of overflows :( >> >> > FYI, I worked around the issue by using two 520-CR NICs instead of the > one 540-CR NIC and performance is solid again with no dropped packets > > Another configuration point on this. Moving to RELENG_13 has some different defaults with respect to power/performance ratios for my motherboard and CPU (SuperMicro X11SCH-F,  Xeon(R) E-2226G). On RELENG_13, the hwpstate_intel attaches by default and is used to scale up and down the CPU frequency. I am guessing due to the somewhat bursty nature of the load, the CPU scaling down to 800Hz could not scale back up fast enough to deal with a sudden burst of traffic going from say 300Mb/s to 800Mb/s and some packets would overflow the NIC's buffers.  Printing out the cpu frequency once per second, it would be constantly floating up and down from 900 to 4300.  At first, I couldnt quite get my head around the fact that the most lost packets would happen at the lowest pps periods.  Once I started to graph the cpu freq, CPU temp, pps, Mb/s, the pattern really stood out.  Sure enough, setting dev.hwpstate_intel.0.epp=0 on the cores from the default of 50 (see HWPSTATE_INTEL(4)  ) made the difference. #  sysctl -a dev.cpufreq.0.freq_driver dev.cpufreq.0.freq_driver: hwpstate_intel0 #     ---Mike