From owner-freebsd-fs@freebsd.org Sun May 15 10:48:47 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F0580B3A5DC for ; Sun, 15 May 2016 10:48:47 +0000 (UTC) (envelope-from s_sourceforge@nedprod.com) Received: from mail.nedprod.com (europe4.nedproductions.biz [213.251.186.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id A09611822 for ; Sun, 15 May 2016 10:48:47 +0000 (UTC) (envelope-from s_sourceforge@nedprod.com) Received: from authenticated-user (mail.nedprod.com [213.251.186.177]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.nedprod.com (Postfix) with ESMTPSA id E9BC415467 for ; Sun, 15 May 2016 11:48:44 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=nedprod.com; s=mail; t=1463309325; bh=UCsylhJme7/vwCo/gBDC2bPdEIyX9s3ntwKcYVkdCUM=; h=From:To:Date:Subject:In-reply-to:References:From; b=nZZCjxn/MGs771Jw7qvoWI8OBKbFd1gmqwOf2N9MJVgA4dHzsFbh/SqsjZr9ark0/ 2BLqd+OWU9Q8ryE2QghOnNG2qIjuE/0/JWuKMp+wj1RQEkDRpWvpNz+MTCYuT9+2qN Q/poWu5Quuct4YQvfNJmdj0jR4DkTBN0JfIQWSWbS9wM8Yzznqy6/fn+Bwkbc1fj/P 1J4ND7eXpUqaMCKXO8ylnxpjX2TlL8soSUWZXMjOsbroIlG9tefcEdTgoFivWWLTZN 9itG/GTO76HCuNkkOVxi5naPp9Ou7sTc+O4drGvX6HHVeBCAl484NzZ9+rTsofOM4X SjJCGeteY8gxg== From: "Niall Douglas" To: "freebsd-fs@FreeBSD.org" Date: Sun, 15 May 2016 11:45:42 +0100 MIME-Version: 1.0 Subject: Re: State of native encryption in ZFS Message-ID: <57385356.4525.E728971@s_sourceforge.nedprod.com> Priority: normal In-reply-to: References: <5736E7B4.1000409@gmail.com>, <57378707.19425.B54772B@s_sourceforge.nedprod.com>, X-PM-Encryptor: IDWSM-PM32, 4 Content-Type: multipart/signed; protocol="application/x-pkcs7-signature"; micalg=sha1; boundary="SMime-=-=-Boundary-=-=-99E31796" X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 May 2016 10:48:48 -0000 --SMime-=-=-Boundary-=-=-99E31796 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: Quoted-printable Content-Description: Mail message body On 14 May 2016 at 16:09, K. Macy wrote: > >> It=E2=80=99s not even clear ho= w that encryption would be implemented or exposed. > >> Per pool? Per dat= aset? Per folder? Per file? There have been > >> requests for all of the= above at one time or another, and the key > >> management challenges for e= ach are different. They can also be > >> implemented at a layer above ZFS,= given sufficient interest. > > > > If FreeBSD had a bigger PATH_MAX then st= ackable encryptions layers > > like ecryptfs (encfs?) would be viable choic= es. Because encrypted > > path components are so long, one runs very rapidl= y into the maximum > > path on the system when PATH_MAX is so low. > > > > I = ended up actually installing ZFS on Linux with ecryptfs on top to > > solve= this. Every 15 minutes it ZFS snapshot syncs with the FreeBSD > > edition.= This works very well, apart from the poor performance of ZFS > > on Linux. = > > > > ZFS handles long paths with ease. FreeBSD currently does not :( > > = AFAICT that's a 1 line patch. Have you tried patching that and > rebuilding= kernel, world, and any vulnerable ports? The problem is apparently kernel = structure bloat and that they want to remove fixed maximum paths altogethe= r so it would be boot modifiable. http://freebsd.1045724.n5.nabble.com/misc= -184340-PATH-MAX-not-interope rable-with-Linux-td5864469.html As laudable as= the latter goal is, unfortunately OS X and Linux hard code theirs, and mu= ch POSIX software will use whatever PATH_MAX is set to. I'm therefore not = sure the implementation cost is worth it. In any case, a 1024 byte path lim= it is just 256 unicode characters potentially. That's worse than Windows 9= 5 :( Niall -- ned Productions Limited Consulting http://www.nedproductions.bi= z/ http://ie.linkedin.com/in/nialldouglas/ --SMime-=-=-Boundary-=-=-99E31796 Content-Type: application/x-pkcs7-signature; name=SMime.p7s Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename=SMime.p7s MIIY1AYJKoZIhvcNAQcCoIIYxTCCGMECAQExCzAJBgUrDgMCGgUAMAsGCSqGSIb3 DQEHAaCCFYIwggY0MIIEHKADAgECAgEgMA0GCSqGSIb3DQEBBQUAMH0xCzAJBgNV BAYTAklMMRYwFAYDVQQKEw1TdGFydENvbSBMdGQuMSswKQYDVQQLEyJTZWN1cmUg RGlnaXRhbCBDZXJ0aWZpY2F0ZSBTaWduaW5nMSkwJwYDVQQDEyBTdGFydENvbSBD ZXJ0aWZpY2F0aW9uIEF1dGhvcml0eTAeFw0wNzEwMjQyMTAyNTVaFw0xNzEwMjQy MTAyNTVaMIGMMQswCQYDVQQGEwJJTDEWMBQGA1UEChMNU3RhcnRDb20gTHRkLjEr MCkGA1UECxMiU2VjdXJlIERpZ2l0YWwgQ2VydGlmaWNhdGUgU2lnbmluZzE4MDYG A1UEAxMvU3RhcnRDb20gQ2xhc3MgMiBQcmltYXJ5IEludGVybWVkaWF0ZSBDbGll bnQgQ0EwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDLKIVFnAEs+xny q6UzjCqgDcvQVe1dIoFnRsQPCFO+y92k8RK0Pn3MbQ2Gd+mehh9GBZ+36uUQA7Xj 9AGM6wgPhEE34vKtfpAN5tJ8LcFxveDObCKrL7O5UT9WsnAZHv7OYPYSR68mdmnE nJ83M4wQgKO19b+Rt8sPDAz9ptkQsntCn4GeJzg3q2SVc4QJTg/WHo7wF2ah5LMO eh8xJVSKGEmd6uPkSbj113yKMm8vmNptRPmM1+YgmVwcdOYJOjCgFtb2sOP79jji 8uhWR91xx7TpM1K3hv/wrBZwffrmmEpUeuXHRs07JqCCvFh9coKF4UQZvfEg+x3/ 69xRCzb1AgMBAAGjggGtMIIBqTAPBgNVHRMBAf8EBTADAQH/MA4GA1UdDwEB/wQE AwIBBjAdBgNVHQ4EFgQUrlWDb+wxyrn3HfqvazHzyB3jrLswHwYDVR0jBBgwFoAU TgvvGqRAW6UXaYcwyjRoQ9BBrvIwZgYIKwYBBQUHAQEEWjBYMCcGCCsGAQUFBzAB hhtodHRwOi8vb2NzcC5zdGFydHNzbC5jb20vY2EwLQYIKwYBBQUHMAKGIWh0dHA6 Ly93d3cuc3RhcnRzc2wuY29tL3Nmc2NhLmNydDBbBgNVHR8EVDBSMCegJaAjhiFo dHRwOi8vd3d3LnN0YXJ0c3NsLmNvbS9zZnNjYS5jcmwwJ6AloCOGIWh0dHA6Ly9j cmwuc3RhcnRzc2wuY29tL3Nmc2NhLmNybDCBgAYDVR0gBHkwdzB1BgsrBgEEAYG1 NwECATBmMC4GCCsGAQUFBwIBFiJodHRwOi8vd3d3LnN0YXJ0c3NsLmNvbS9wb2xp Y3kucGRmMDQGCCsGAQUFBwIBFihodHRwOi8vd3d3LnN0YXJ0c3NsLmNvbS9pbnRl cm1lZGlhdGUucGRmMA0GCSqGSIb3DQEBBQUAA4ICAQA6qScNyNO0FpHvaZTQacVM XH33O51KyEKSRw3IvdQxRu31YR0ZDGdSfgSoOVDVMSBSdmfQfdDInHPzV3LO5DwU XZ+lxjv7z3PO2OkfnFkvTXPfn6dxJ5rJveDsTsCPcJ/Kp6/+qN5g+J6D/SaYcFD0 18B6L42r0Z4VEBy36P4tjRtF14Ex10tl5tJFVKM16qWKQHbpjIgf73s49UB0CQ5l HT2DHKfq3oPfdNc5Mk93w1v4ryVb+qVrZIej8NsrWU+5r4O2IV91edDb/OtHFddZ qHFFXKgS79IHE/hwQ2LW7r3sTX7cDUCg+dfdwO8zeLxuwk2JF8crUoyrl66RGrRI hT8VoG/OJ1Y9uUlOav69V4cG8upi4ZG2l7JZFbcBFk91Wp+Payo5SuF61CmGFrZ3 86umkmpObtFacXda2O/bVoQ9xHQrzoTc/0KZTWvlZCLK3Ke/vGYT9ZdW9lOjGsSF bXrlTA919L84iMK+48WGnvRWY28ZaVHpql43AtEGhXze6iNCbEDACy+4hkQYOytA qDgcxAnQ937mYpeZFPyz/XK9QSt9VNFMuudWxZwDDDJKoQAoSG59Hou9lZ26UrK6 0nRdAQBmEPL8h2nuWgoPh++XVQld9yuhbsWa39Pck8/lcfz5HUVGJF5mc/zk38iV 7FDlF68puiryNq2KXHEpOTCCB3kwggZhoAMCAQICAk++MA0GCSqGSIb3DQEBBQUA MIGMMQswCQYDVQQGEwJJTDEWMBQGA1UEChMNU3RhcnRDb20gTHRkLjErMCkGA1UE CxMiU2VjdXJlIERpZ2l0YWwgQ2VydGlmaWNhdGUgU2lnbmluZzE4MDYGA1UEAxMv U3RhcnRDb20gQ2xhc3MgMiBQcmltYXJ5IEludGVybWVkaWF0ZSBDbGllbnQgQ0Ew HhcNMTQwNzE5MDUyOTU4WhcNMTYwNzE4MjE1NTM5WjCBjjEZMBcGA1UEDRMQNjlS SUc0ajZNN2ZpNTRURDELMAkGA1UEBhMCSUUxDTALBgNVBAgTBENvcmsxEzARBgNV BAcTCktlcnJ5IFBpa2UxFjAUBgNVBAMTDU5pYWxsIERvdWdsYXMxKDAmBgkqhkiG 9w0BCQEWGXNfc291cmNlZm9yZ2VAbmVkcHJvZC5jb20wggIiMA0GCSqGSIb3DQEB AQUAA4ICDwAwggIKAoICAQC0mleHTofGvJXwH9xAr0+IU5dTotN0BOF1W/vhVoOT bvD0bxesFkPuemSopttKgc94p8FCgEqymbldJrki1cBsc73gODT4XHEigPktcSaF a2jUxkRmL3gfnhEyQ2d7P+ujJCQcur+Ug1xcJjbpQ8eq1dPI6mznITdARqENYqA6 vhH/VNg2n80ksE5HiA1xx2Trd6synZplenahybHkf1pSlyTS9bKeuKi1awIkF/1w QxsckB+ZBHdgPxT/RdFqE7aPF5+VSvbP2wEyieOCWDMCRG4mpsa0Ow54Ytdvf7za 6iGn8VWHwe8E85QpYzfp5RUGsScdo2vcpccLrGDMUDV3AZrcOWmE1r9oAvb3b0o1 4VY+ZE052arIPDpxYUOtpw2/rlxOGrdB1MemXuv2CQx2J2w0p6iOTeISB7xWtIi+ ZuCB5db62NnEh3txKvqDHCX8SYK6qE4PSrnHtb+ziCrYLkQ28lCWUPuwoamstLu0 B8ngNXEoOYuv8ADXc/OufLDrlPt7O0p0QvkEqIexBHCbjiohqFxqvxNxzYo20g5u A3eMymI2F2XOYz/m+muqFYbfy+/2KXrsgjM8oZ5eUqeZES8zY91VH+Ps9ryo/jv/ un6f0FfwzAjO/PkizTxLc5NS138mNBGk/NpWYHCRiTb0A7WiXn2SnpUiGi+IWFyu uQIDAQABo4IC3zCCAtswCQYDVR0TBAIwADALBgNVHQ8EBAMCBLAwHQYDVR0lBBYw FAYIKwYBBQUHAwIGCCsGAQUFBwMEMB0GA1UdDgQWBBRpgKYvPXl8EYUnJmSNpjoT f/OpKjAfBgNVHSMEGDAWgBSuVYNv7DHKufcd+q9rMfPIHeOsuzAkBgNVHREEHTAb gRlzX3NvdXJjZWZvcmdlQG5lZHByb2QuY29tMIIBTAYDVR0gBIIBQzCCAT8wggE7 BgsrBgEEAYG1NwECAzCCASowLgYIKwYBBQUHAgEWImh0dHA6Ly93d3cuc3RhcnRz c2wuY29tL3BvbGljeS5wZGYwgfcGCCsGAQUFBwICMIHqMCcWIFN0YXJ0Q29tIENl cnRpZmljYXRpb24gQXV0aG9yaXR5MAMCAQEagb5UaGlzIGNlcnRpZmljYXRlIHdh cyBpc3N1ZWQgYWNjb3JkaW5nIHRvIHRoZSBDbGFzcyAyIFZhbGlkYXRpb24gcmVx dWlyZW1lbnRzIG9mIHRoZSBTdGFydENvbSBDQSBwb2xpY3ksIHJlbGlhbmNlIG9u bHkgZm9yIHRoZSBpbnRlbmRlZCBwdXJwb3NlIGluIGNvbXBsaWFuY2Ugb2YgdGhl IHJlbHlpbmcgcGFydHkgb2JsaWdhdGlvbnMuMDYGA1UdHwQvMC0wK6ApoCeGJWh0 dHA6Ly9jcmwuc3RhcnRzc2wuY29tL2NydHUyLWNybC5jcmwwgY4GCCsGAQUFBwEB BIGBMH8wOQYIKwYBBQUHMAGGLWh0dHA6Ly9vY3NwLnN0YXJ0c3NsLmNvbS9zdWIv Y2xhc3MyL2NsaWVudC9jYTBCBggrBgEFBQcwAoY2aHR0cDovL2FpYS5zdGFydHNz bC5jb20vY2VydHMvc3ViLmNsYXNzMi5jbGllbnQuY2EuY3J0MCMGA1UdEgQcMBqG GGh0dHA6Ly93d3cuc3RhcnRzc2wuY29tLzANBgkqhkiG9w0BAQUFAAOCAQEAhR1+ CDw7mNmPZUiu/pEteirAI75LpBVUhwuzuU9xfglwFfhAaNX9z95wP3qMphThpIWr kLR4KkMEgHvJTTJ/3KVq0rnNEt2V3605SZDPPlVnt7MMBOlNN8aeClRP62W/GOXa RBfO/w7k8yheUnD8OYtU51rFopIamQkRFXcqdZ0V1rUG0GLiPD1CuRevKop7ebcT YzVFcO0aHFnW2qtn/4OX7W1gQka0pi9zUNXilqXApNjjWIenOtb44KXBFxEqJ7i/ EozUxRExWu7mov+geijuVVYxOT7N7zoQ9aWTJQVn6vNdGqmqZ5XcKtVXHLLFefhh yTBqa0d2jJ4exZYC5TCCB8kwggWxoAMCAQICAQEwDQYJKoZIhvcNAQEFBQAwfTEL MAkGA1UEBhMCSUwxFjAUBgNVBAoTDVN0YXJ0Q29tIEx0ZC4xKzApBgNVBAsTIlNl Y3VyZSBEaWdpdGFsIENlcnRpZmljYXRlIFNpZ25pbmcxKTAnBgNVBAMTIFN0YXJ0 Q29tIENlcnRpZmljYXRpb24gQXV0aG9yaXR5MB4XDTA2MDkxNzE5NDYzNloXDTM2 MDkxNzE5NDYzNlowfTELMAkGA1UEBhMCSUwxFjAUBgNVBAoTDVN0YXJ0Q29tIEx0 ZC4xKzApBgNVBAsTIlNlY3VyZSBEaWdpdGFsIENlcnRpZmljYXRlIFNpZ25pbmcx KTAnBgNVBAMTIFN0YXJ0Q29tIENlcnRpZmljYXRpb24gQXV0aG9yaXR5MIICIjAN BgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEAwYjbCbxsRnx4n5V7tTOQ8nJi1sE2 ICIkXs7pd/JDCqIGZKTMjjb4OOYj8G5tsTzdcqOFHKHTPbQzK9Mvr/7qsEFZZ7bE Bn0KnnSF1nlMgDd63zkFUln39BtGQ6TShYXSw3HzdWI0uiyKfx6P7u000BHHls1S Pboz1t1N3gs7SkufwiYv+rUWHHI1d8o8XebK4SaLGjZ2XAHbdBQl/u21oIgP3XjK LR8HlzABLXJ5+kbWEyqouaarg0kd5fLv3eQBjhgKj2NTFoViqQ4ZOsy1ZqbCa3QH 5Cvhdj60bdj2ROFzYh87xL6gU1YlbFEJ96qryr92/W2b853bvz1mvAxWqq+YSJU6 S9+nWFDZOHWpW+pDDAL/mevobE1wWyllnN2qXcyvATHsDOvSjejqnHvmbvcnZgwa SNduQuM/3iE+e+ENcPtjqqhsGlS0XCV6yaLJixamuyx+F14FTVhuEh0B7hIQDcYy fxj//PT6zW6R6DZJvhpIaYvClk0aErJpF8EKkNb6eSJIv7p7afhwx/p6N9jYDdJ2 T1f/kLfjkdLd78Jgt2c63f6qnPDUi39yIs7Gn5e2+K+KoBCo2fsYxra1XFI8ibYZ KnMBCg8DsxJg8novgdujbv8mMJf1i92JV7atPbOvK8W3dgLwpdYrmoYUKnL24zOM XQlLE9+7jHQTUksCAwEAAaOCAlIwggJOMAwGA1UdEwQFMAMBAf8wCwYDVR0PBAQD AgGuMB0GA1UdDgQWBBROC+8apEBbpRdphzDKNGhD0EGu8jBkBgNVHR8EXTBbMCyg KqAohiZodHRwOi8vY2VydC5zdGFydGNvbS5vcmcvc2ZzY2EtY3JsLmNybDAroCmg J4YlaHR0cDovL2NybC5zdGFydGNvbS5vcmcvc2ZzY2EtY3JsLmNybDCCAV0GA1Ud IASCAVQwggFQMIIBTAYLKwYBBAGBtTcBAQEwggE7MC8GCCsGAQUFBwIBFiNodHRw Oi8vY2VydC5zdGFydGNvbS5vcmcvcG9saWN5LnBkZjA1BggrBgEFBQcCARYpaHR0 cDovL2NlcnQuc3RhcnRjb20ub3JnL2ludGVybWVkaWF0ZS5wZGYwgdAGCCsGAQUF BwICMIHDMCcWIFN0YXJ0IENvbW1lcmNpYWwgKFN0YXJ0Q29tKSBMdGQuMAMCAQEa gZdMaW1pdGVkIExpYWJpbGl0eSwgcmVhZCB0aGUgc2VjdGlvbiAqTGVnYWwgTGlt aXRhdGlvbnMqIG9mIHRoZSBTdGFydENvbSBDZXJ0aWZpY2F0aW9uIEF1dGhvcml0 eSBQb2xpY3kgYXZhaWxhYmxlIGF0IGh0dHA6Ly9jZXJ0LnN0YXJ0Y29tLm9yZy9w b2xpY3kucGRmMBEGCWCGSAGG+EIBAQQEAwIABzA4BglghkgBhvhCAQ0EKxYpU3Rh cnRDb20gRnJlZSBTU0wgQ2VydGlmaWNhdGlvbiBBdXRob3JpdHkwDQYJKoZIhvcN AQEFBQADggIBABZsmfRmDDT10IVefQrs2hBOOBxe36YlBUuRMsHoO/E93UQJWwdJ iinLZgK3sZr3JZgJPI4b4d02hytLu2jTOWY9oCbH8jmRHVGrgnt+1c5a5OIDV3Bp lwj5XlimCt+MBppFFhY4Cl5X9mLHegIF5rwetfKe9Kkpg/iyFONuKIdEw5Aa3jip PKxDTWRFzt0oqVzyc3sE+Bfoq7HzLlxkbnMxOhK4vLMR5H2PgVGaO42J9E2TZns8 A+3Tmh2a82VQ9aDQdZ8vr/DqgkOY+GmciXnEQ45GcuNkNhKv9yUeOImQd37Da2q5 w8tES6x4kIvnxyweSxFEyDRSJ80KXZ+FwYnVGnjylRBTMt2AhGZ12bVoKPthLr6E qDjAmRKGpR5nZK0GLi+pcIXHlg98iWX1jkNUDqvdpYA5lGDANMmWcCyjEvUfSHu9 HH5rt52Q9CI7rvj8Ksr6glKg769LVZPrwbXwIousNE4mIgShhyx1SrflfRPXuAxk wDbSyS+GEowjCcEbgjtzSaNqV4eU5dZ4xZlDY+NN4Hct4WWZcmkEGkcJ5g8BViT7 H78OealYLrnECQF+lbptAAY+supKEDnY0Cv1v+x1v5cCxQkbCNxVN+KB+zeEQ2Ig yudWS2Xq/mzBJJMkoTTrBf+aIq6bfT/xZVEKpjBqs/SIHIAN/HKK6INeMYIDGjCC AxYCAQEwgZMwgYwxCzAJBgNVBAYTAklMMRYwFAYDVQQKEw1TdGFydENvbSBMdGQu MSswKQYDVQQLEyJTZWN1cmUgRGlnaXRhbCBDZXJ0aWZpY2F0ZSBTaWduaW5nMTgw NgYDVQQDEy9TdGFydENvbSBDbGFzcyAyIFByaW1hcnkgSW50ZXJtZWRpYXRlIENs aWVudCBDQQICT74wCQYFKw4DAhoFAKBdMBgGCSqGSIb3DQEJAzELBgkqhkiG9w0B BwEwHAYJKoZIhvcNAQkFMQ8XDTE2MDUxNTEwNDU0MlowIwYJKoZIhvcNAQkEMRYE FF7ls6ViN6WT1Qn+5+CYM95Mcl3FMA0GCSqGSIb3DQEBAQUABIICAJDX9JRCC6i+ lD17VA+YTwwMw4V5LXB7NnJNZ/TKFcd4Fn66+D34Jw5H4x79JYKUH1X8QhLF+tmJ viLlOc6iotuZDgxWJe7H3RV3Tbh0YT7KLZgCpNS3DSW/ttPJnoWaAt5Twl7S0z8+ 1IZvxHuYS1C0vybnJ+FR4GLpDwIFT6Tpn5vaR1Y+BzlSZhhnwets0GusTnD8eg2g cR3A7sXeWcYQrbSBQ4djnBriXMqOGh4/iWdi1GEg25SZ0UTKZrSwTcBCO5yztmSF /0KekHnqjcpAabpxtpopVeK3GAb9Kg4YqHPU2viF6wug7NTxyCO5oTTnd2DK3CVq Y5zVe1Ycl4a1pa4NQ45RGefebBJlPO6jpvHFi7zY19bcgslqzGYMEfJM9paMlnYn S+UbUPcwKV3WHyDy3iaJZ4P510KiYeNATbf/nUzYY6sjdI+PFKsHzWQ1Q83hIYb1 2dYg9n+SVYKFVq1QzArLvFJe2GLTyJXLTxWLb+wbSftZl6Q3IfqiJ5pw8ol/t3jY 6mffUm8a2gkQVEzcIACAyqVtu/WQXmpN7Q2JGd68zjna35nx7dbQ1oWvCP2e9ztd e1BsbsE+TfZsB2NES7m8z5V+2GYXUfjnlzMP/qLD3OIhbZYifhj750aJWA6v9pUg jG4ehQcsmznQf3B2080lZKB8bDyDxps7 --SMime-=-=-Boundary-=-=-99E31796-- From owner-freebsd-fs@freebsd.org Sun May 15 13:42:51 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C0D37B3B7BC; Sun, 15 May 2016 13:42:51 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citapm.icyb.net.ua (citapm.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 73E7E1099; Sun, 15 May 2016 13:42:46 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citapm.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id QAA04848; Sun, 15 May 2016 16:38:02 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1b1wF8-000GHJ-8t; Sun, 15 May 2016 16:38:02 +0300 To: freebsd-arch@FreeBSD.org, freebsd-fs From: Andriy Gapon Subject: mount / unmount and mountcheckdirs() Message-ID: <5c01bf62-b7b2-2e1d-bca5-859e6bf1f0e5@FreeBSD.org> Date: Sun, 15 May 2016 16:37:05 +0300 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 May 2016 13:42:51 -0000 I am curious about the purpose of mountcheckdirs() called when mounting and unmounting a filesystem. The function is described as such: /* * Scan all active processes and prisons to see if any of them have a current * or root directory of `olddp'. If so, replace them with the new mount point. */ and it seems to be used to "lift" processes and jails to a root of a new filesystem when it is mounted and to "lower" them onto a covered vnode (if any) when a filesystem is unmounted. What's the purpose of those actions? It's strange that the machinations are done at all, but it is stranger that they are applied only to processes and jails at exactly a covered vnode and a root vnode. Anything below in a filesystem's tree is left alone. Is there anything so very special about being at exactly those points? IMO, the machinations can have unexpected security consequences. A little bit of history. mountcheckdirs() was first added in r22521 (circa 1997) as checkdirs with a rather non-specific commit message. Initially it was used only when a filesystem was mounted. Then, in r73241 (circa 2002) the function was added to dounmount(): The checkdirs() function is called at mount time to find any process fd_cdir or fd_rdir pointers referencing the covered mountpoint vnode. It transfers these to point at the root of the new filesystem. However, this process was not reversed at unmount time, so processes with a cwd/root at a mount point would unexpectedly lose their cwd/root following a mount-unmount cycle at that mountpoint. ... Dounmount() now undoes the actions taken by checkdirs() at mount time; any process cdir/rdir pointers that reference the root vnode of the unmounted filesystem are transferred to the now-uncovered vnode. -- Andriy Gapon From owner-freebsd-fs@freebsd.org Sun May 15 16:53:37 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D384CB3C990; Sun, 15 May 2016 16:53:37 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-wm0-x244.google.com (mail-wm0-x244.google.com [IPv6:2a00:1450:400c:c09::244]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7DEA9120F; Sun, 15 May 2016 16:53:37 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: by mail-wm0-x244.google.com with SMTP id e201so13358314wme.2; Sun, 15 May 2016 09:53:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to:user-agent; bh=Y1hB2QtQw86g5Esi718zi9W2Rdhm8nQI48Gzl+oxQLo=; b=fXlhRU7dtzRsrz0aQqMjrJsYvzPVwaMBFS7+rbJgQjGr2ckEKqn6bZF6XcdC9abSIl OJYE+NvQkHc2C80ziEdPrOHABDXweIN7oCMgKrTGufI/emJ4Qe2Tk64vb+KmjqqS2JDG I4eREXCsZP3kvcPbIspl5+F4fx2cdX4E1NnLAi1mKmiCvb++mkpv4WUAAiuVOlnoOh7j 15mXgYkSmCbwz43F9RvfOJFznvSSSZycGZhrO+WbCEZkVdwPE6X7BdfBoQVMititnuZR /mffZVtKvCMzaaUJ7aakyzC3AL+9yllMeshI4w2TQMasoQRxJCCaoSKJ48WoIHQMKKxW EVJQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=Y1hB2QtQw86g5Esi718zi9W2Rdhm8nQI48Gzl+oxQLo=; b=EPMHiYsTPvh0DqxUHK+GNfKGps0OOVSfjjXDbj5Kza9LPBhGBLzxAWCa7pI2WtEJuH HO/THMTsBJB2KFfqLitdCZyXiufpmtpRWZmoXrIOJXqH2XwPqH7EC8/xpNihABPfUosZ itfN3bdhwdrEvDdV6+z3DKizDdcB9ISoXZBmRVglyJxiPUMpv0c262NlCwuNboIZQrs9 XPr4Kaqs7DiWX7bzO4UgDyEEGUg9ZW323u0SaxcZTbaY+EWnsn9TH5W39SKvi/TYgBHV 6DqWwaYfa7Obttb7MA64PQf2QG4FvRFd2I5fvRiAp3/Z1v2H/S+KUtZhD2QKE97RKhI0 9umg== X-Gm-Message-State: AOPr4FVvul2R3LpF1VutAwX2UMkvpoR32Rpyqxsf02LnkAOkwHQm9YYzEcLL84Pgom9tVQ== X-Received: by 10.194.234.167 with SMTP id uf7mr25495264wjc.122.1463331216058; Sun, 15 May 2016 09:53:36 -0700 (PDT) Received: from dft-labs.eu (n1x0n-1-pt.tunnel.tserv5.lon1.ipv6.he.net. [2001:470:1f08:1f7::2]) by smtp.gmail.com with ESMTPSA id c194sm13990641wme.9.2016.05.15.09.53.34 (version=TLS1_2 cipher=AES128-SHA bits=128/128); Sun, 15 May 2016 09:53:34 -0700 (PDT) Date: Sun, 15 May 2016 18:53:32 +0200 From: Mateusz Guzik To: Andriy Gapon Cc: freebsd-arch@FreeBSD.org, freebsd-fs Subject: Re: mount / unmount and mountcheckdirs() Message-ID: <20160515165332.GA27836@dft-labs.eu> References: <5c01bf62-b7b2-2e1d-bca5-859e6bf1f0e5@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <5c01bf62-b7b2-2e1d-bca5-859e6bf1f0e5@FreeBSD.org> User-Agent: Mutt/1.5.21 (2010-09-15) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 May 2016 16:53:37 -0000 On Sun, May 15, 2016 at 04:37:05PM +0300, Andriy Gapon wrote: > > I am curious about the purpose of mountcheckdirs() called when mounting and > unmounting a filesystem. > > The function is described as such: > /* > * Scan all active processes and prisons to see if any of them have a current > * or root directory of `olddp'. If so, replace them with the new mount point. > */ > and it seems to be used to "lift" processes and jails to a root of a new > filesystem when it is mounted and to "lower" them onto a covered vnode (if any) > when a filesystem is unmounted. > > What's the purpose of those actions? > It's strange that the machinations are done at all, but it is stranger that they > are applied only to processes and jails at exactly a covered vnode and a root > vnode. Anything below in a filesystem's tree is left alone. Is there anything > so very special about being at exactly those points? > > IMO, the machinations can have unexpected security consequences. > I don't know why this was implemented. It is also being done in NetBSD. It is not done in Solaris nor Linux. Replacement is buggy in at least 2 ways: 1. the process vs jail vnode replacement leaves a time window where these 2 don't match, which screws up with the look up 2. on fork we can have a 'struct filedesc' object copied but not assigned to the new process yet, so it ends up with the old vnode And indeed, interested parties still have access to old vnodes by means of having a file descriptor. That said, this likely needs to be simply changed to /deny/ mount operations which would alter jail roots. -- Mateusz Guzik From owner-freebsd-fs@freebsd.org Sun May 15 21:00:05 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 3F59EB3CB0C for ; Sun, 15 May 2016 21:00:05 +0000 (UTC) (envelope-from bugzilla-noreply@FreeBSD.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 1E4B11A3B for ; Sun, 15 May 2016 21:00:05 +0000 (UTC) (envelope-from bugzilla-noreply@FreeBSD.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4FL01nx037409 for ; Sun, 15 May 2016 21:00:04 GMT (envelope-from bugzilla-noreply@FreeBSD.org) Message-Id: <201605152100.u4FL01nx037409@kenobi.freebsd.org> From: bugzilla-noreply@FreeBSD.org To: freebsd-fs@FreeBSD.org Subject: Problem reports for freebsd-fs@FreeBSD.org that need special attention Date: Sun, 15 May 2016 21:00:04 +0000 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sun, 15 May 2016 21:00:05 -0000 To view an individual PR, use: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=(Bug Id). The following is a listing of current problems submitted by FreeBSD users, which need special attention. These represent problem reports covering all versions including experimental development code and obsolete releases. Status | Bug Id | Description ------------+-----------+--------------------------------------------------- New | 203492 | mount_unionfs -o below causes panic Open | 136470 | [nfs] Cannot mount / in read-only, over NFS Open | 139651 | [nfs] mount(8): read-only remount of NFS volume d Open | 144447 | [zfs] sharenfs fsunshare() & fsshare_main() non f 4 problems total for which you should take action. From owner-freebsd-fs@freebsd.org Mon May 16 05:02:37 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 22A89B3CA4C for ; Mon, 16 May 2016 05:02:37 +0000 (UTC) (envelope-from kmacybsd@gmail.com) Received: from mail-ig0-x236.google.com (mail-ig0-x236.google.com [IPv6:2607:f8b0:4001:c05::236]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E52101734 for ; Mon, 16 May 2016 05:02:36 +0000 (UTC) (envelope-from kmacybsd@gmail.com) Received: by mail-ig0-x236.google.com with SMTP id s8so34516387ign.0 for ; Sun, 15 May 2016 22:02:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:sender:date:message-id:subject:from:to; bh=aQULUxC7HkCczlSVOfSxpyJBn9o6BXP3zyhQUQm9NWQ=; b=rZHtwISIrxQyMTbfgM2EP3fm9tgrgu/p13W5sMpPGl/44bl43fbEnA8yr6AZM7AbJR ZYFldykzwHLgE2Kr8QIxGQvrTgqAxAfqXISM+DFkP/YFCU3j5fvYV2E9Ec0kJBRVOt1t NJnESfWLm1R+/dI/BC/iIiEt5J7iTAYrEQj1jNIO3XdXh2EzsdvHxcyGvQFJQpIGtv2/ KZPK6MLyKoJyEcs5Eaplx3o9LZvPYo+pnNd+RUv7AnHLB35QTGO2nPQC0mCXYYkELCIA oDbvX+jNnlGPncXqZgKrnDFWBxH1fYcL/oFyCP5bzZd165U7ALvVd/aqg3KqgW1QPRRZ /K3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:sender:date:message-id:subject:from :to; bh=aQULUxC7HkCczlSVOfSxpyJBn9o6BXP3zyhQUQm9NWQ=; b=FfXRTIIk/m7az7oXnZr6PXkyzK/+1+i+jtgltIcZD7BRtAtiDhTXv90awwGWRweh2R 5nYAY+r4xhv8pPTAhdBvlslAMdPIsl4cO77nrq+XmU9eTE84ds7xlZPe/4OWFRH+rP3/ sx8xW3LfYiM62NfGqnrfNLMhqEISg0p+jddaN3B55+Ec7HJjflYDyzhbArSkP3zVLFTZ 7vQ4VXrJ67pc1IgpuAagMhq4rS8f1jH0U0qV9CDSLGzBgxFuZ1i7mCV7NCpsaA+/NatQ O6Xz7a/Vmqtw1TJn93ZeMKni5iU6o9daszhDpfBaEfdkd1zN3qS9qX0RyM2iexyQFE7D dXjQ== X-Gm-Message-State: AOPr4FUYncgdLJbJfJhci+2nHzLYOXFr9ptV5uXbHEnbor8tQ2PZ2S4QNjDMb2SinG46cTI7GP2Lq2xqsTbWVQ== MIME-Version: 1.0 X-Received: by 10.50.225.165 with SMTP id rl5mr8705970igc.63.1463374956447; Sun, 15 May 2016 22:02:36 -0700 (PDT) Sender: kmacybsd@gmail.com Received: by 10.107.140.8 with HTTP; Sun, 15 May 2016 22:02:36 -0700 (PDT) Date: Sun, 15 May 2016 22:02:36 -0700 X-Google-Sender-Auth: upTJS6qsvDDeQg4zNR9J5wE5uUQ Message-ID: Subject: bug in umass? From: "K. Macy" To: Hans Petter Selasky , "freebsd-fs@FreeBSD.org" Content-Type: text/plain; charset=UTF-8 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 05:02:37 -0000 I'm not able to complete a coredump in i915 to a USB key. The backtrace in the log looks like a bug in umass. May 15 21:57:10 beastie kernel: cmap[0]=0 cmap[1]=7f0000 cmap[2]=7f00 cmap[3]=c4a000 May 15 21:57:10 beastie kernel: end FB_INFO May 15 21:57:10 beastie kernel: drmn0: fb0: inteldrmfb frame buffer device May 15 21:57:10 beastie kernel: ..3% May 15 21:58:16 beastie syslogd: kernel boot file is /boot/kernel/kernel May 15 21:58:16 beastie kernel: trap_fatal() at trap_fatal+0x2d/frame 0xfffffe01e2fd6350 May 15 21:58:16 beastie kernel: trap() at trap+0xc48/frame 0xfffffe01e2fd6690 May 15 21:58:16 beastie kernel: trap_check() at trap_check+0x4a/frame 0xfffffe01e2fd66b0 May 15 21:58:16 beastie kernel: calltrap() at calltrap+0x8/frame 0xfffffe01e2fd66b0 May 15 21:58:16 beastie kernel: --- trap 0x9, rip = 0xffffffff80f5a950, rsp = 0xfffffe01e2fd6780, rbp = 0xfffffe01e2fd6810 --- May 15 21:58:16 beastie kernel: __mtx_lock_flags() at __mtx_lock_flags+0xd0/frame 0xfffffe01e2fd6810 May 15 21:58:16 beastie kernel: xpt_done_process() at xpt_done_process+0x495/frame 0xfffffe01e2fd68c0 May 15 21:58:16 beastie kernel: xpt_done_td() at xpt_done_td+0x1c0/frame 0xfffffe01e2fd6930 May 15 21:58:16 beastie kernel: fork_exit() at fork_exit+0x13b/frame 0xfffffe01e2fd69b0 May 15 21:58:16 beastie kernel: fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01e2fd69b0 May 15 21:58:16 beastie kernel: --- trap 0, rip = 0, rsp = 0, rbp = 0 --- From owner-freebsd-fs@freebsd.org Mon May 16 06:24:43 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 279F8B3D104 for ; Mon, 16 May 2016 06:24:43 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 18C361013 for ; Mon, 16 May 2016 06:24:43 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4G6OfFP038646 for ; Mon, 16 May 2016 06:24:42 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 207464] Panic when destroying ZFS snapshot Date: Mon, 16 May 2016 06:24:41 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.2-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: commit-hook@freebsd.org X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: avg@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 06:24:43 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D207464 --- Comment #24 from commit-hook@freebsd.org --- A commit references this bug: Author: avg Date: Mon May 16 06:24:05 UTC 2016 New revision: 299900 URL: https://svnweb.freebsd.org/changeset/base/299900 Log: zfsctl: fix several problems with reference counts * Remove excessive references on a snapshot mountpoint vnode. zfsctl_snapdir_lookup() called VN_HOLD() on a vnode returned from zfsctl_snapshot_mknode() and the latter also had a call to VN_HOLD() on the same vnode. On top of that gfs_dir_create() already returns the vnode with the use count of 1 (set in getnewvnode). So there was 3 references on the vnode. * mount_snapshot() should keep a reference to a covered vnode. That reference is owned by the mountpoint (mounted snapshot filesystem). * Remove cryptic manipulations of a covered vnode in zfs_umount(). FreeBSD dounmount() already does the right thing and releases the cover= ed vnode. PR: 207464 Reported by: dustinwenz@ebureau.com Tested by: Howard Powell MFC after: 3 weeks Changes: head/sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-fs@freebsd.org Mon May 16 06:25:11 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 48D12B3D146 for ; Mon, 16 May 2016 06:25:11 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3A125109A for ; Mon, 16 May 2016 06:25:11 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4G6PA2G039629 for ; Mon, 16 May 2016 06:25:11 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 207464] Panic when destroying ZFS snapshot Date: Mon, 16 May 2016 06:25:10 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.2-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: avg@FreeBSD.org X-Bugzilla-Status: In Progress X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: avg@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_status Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 06:25:11 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D207464 Andriy Gapon changed: What |Removed |Added ---------------------------------------------------------------------------- Status|Open |In Progress --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-fs@freebsd.org Mon May 16 07:03:20 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4DAFCB3DED6; Mon, 16 May 2016 07:03:20 +0000 (UTC) (envelope-from etnapierala@gmail.com) Received: from mail-wm0-x236.google.com (mail-wm0-x236.google.com [IPv6:2a00:1450:400c:c09::236]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id D896E1432; Mon, 16 May 2016 07:03:19 +0000 (UTC) (envelope-from etnapierala@gmail.com) Received: by mail-wm0-x236.google.com with SMTP id a17so120423832wme.0; Mon, 16 May 2016 00:03:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:date:from:to:cc:subject:message-id:mail-followup-to :references:mime-version:content-disposition:in-reply-to:user-agent; bh=JPOvQYSt7RTwNWj5WNCvYQS5JFsuu4sZBydUKZ2z2fA=; b=F1v0eF+z+4Xc49juEOXt0LUUUebPI3Ya2fV7Y/P4aIR6N2F59AXKxdWtzXSWHt/sQb ZOISylbveoDM40XhTOAvQbVoRQ2pLLNncth6pGQiJxLsqOgSPDEQTHehw3iTh+MfsAZJ ENXQcq6/PoBSEOl292WF03K6LRmgzCJxX1y/N/skRQ5zvCno8Y6rZ5FIRxC5vkQ/V2/s Qrp0rft1ITdbD7WvwJfhV3nsA7H+VDeBBfjOASTSQGfK2Zq8MLmR721Dehf1DRyVAhvr K9zFkWMIgqGqAsF9Hu523dLMWiP08Qn8PXDOQFMU576OWymhjxFw/B33r3ebUA00Uh9X 3GqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :mail-followup-to:references:mime-version:content-disposition :in-reply-to:user-agent; bh=JPOvQYSt7RTwNWj5WNCvYQS5JFsuu4sZBydUKZ2z2fA=; b=B6WGRH8RA95o84U63iBaiMdK2LE82x0A/nL7SRgrvpG0Q7VBDkzxNpm/DtFInbmB+R RJcuquh+1PKXRCBiNu9ksnaS+absVYNikieJq4M4KUs7L0bNmv7v3irG5EgcN/m7IxMr ucYWzml8d4ie+ZSnx29s9vhdPrgwK3qCaj8jThTb5JF88EkUDGpJrRKZEvWx2jO2hAhN NVksSUNdw9qboK8uSeVoIL1bGEqzzw8EjuMLkK4+dG9whM9etBxncdfc4lyDau4fYG/B 3JdRZm9iUgjclhsQO0XdfRZjubmwoZE19O/0HXm/sLlcIc5TMjCzA4ZELLfDlk6Fnnme zBqw== X-Gm-Message-State: AOPr4FVli7+0kYXLCNaKsfM4eaOkPc26NVu3FeWtcZVAeshWm36oud2wKa8J8wCWqmJoHg== X-Received: by 10.194.19.197 with SMTP id h5mr27803898wje.139.1463382198098; Mon, 16 May 2016 00:03:18 -0700 (PDT) Received: from brick (acyr99.neoplus.adsl.tpnet.pl. [83.11.201.99]) by smtp.gmail.com with ESMTPSA id u6sm32098595wjh.2.2016.05.16.00.03.16 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 16 May 2016 00:03:16 -0700 (PDT) Sender: =?UTF-8?Q?Edward_Tomasz_Napiera=C5=82a?= Date: Mon, 16 May 2016 09:03:14 +0200 From: Edward Tomasz =?utf-8?Q?Napiera=C5=82a?= To: Andriy Gapon Cc: freebsd-arch@FreeBSD.org, freebsd-fs Subject: Re: mount / unmount and mountcheckdirs() Message-ID: <20160516070314.GA3029@brick> Mail-Followup-To: Andriy Gapon , freebsd-arch@FreeBSD.org, freebsd-fs References: <5c01bf62-b7b2-2e1d-bca5-859e6bf1f0e5@FreeBSD.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5c01bf62-b7b2-2e1d-bca5-859e6bf1f0e5@FreeBSD.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 07:03:20 -0000 On 0515T1637, Andriy Gapon wrote: > > I am curious about the purpose of mountcheckdirs() called when mounting and > unmounting a filesystem. [..] Whatever you do, please make sure you don't break autofs, and reroot, esp. firmware(9) loading after reroot. I'll happily test patches, just mail them to me. Thanks :-) From owner-freebsd-fs@freebsd.org Mon May 16 07:37:45 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9EC72B3CE67 for ; Mon, 16 May 2016 07:37:45 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8F44B166B for ; Mon, 16 May 2016 07:37:45 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4G7bjAu049779 for ; Mon, 16 May 2016 07:37:45 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 207464] Panic when destroying ZFS snapshot Date: Mon, 16 May 2016 07:37:45 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.2-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: avg@FreeBSD.org X-Bugzilla-Status: In Progress X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: avg@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: attachments.created Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 07:37:45 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D207464 --- Comment #25 from Andriy Gapon --- Created attachment 170343 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D170343&action= =3Dedit add-on patch If you are testing the first patch could you please test this patch on top = of the first patch as well? --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-fs@freebsd.org Mon May 16 07:43:49 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 61BDDB3CFEC; Mon, 16 May 2016 07:43:49 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citapm.icyb.net.ua (citapm.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id 7997A199B; Mon, 16 May 2016 07:43:48 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citapm.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id KAA07644; Mon, 16 May 2016 10:43:46 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1b2DBq-000HKc-DV; Mon, 16 May 2016 10:43:46 +0300 Subject: Re: mount / unmount and mountcheckdirs() To: freebsd-arch@FreeBSD.org, freebsd-fs References: <5c01bf62-b7b2-2e1d-bca5-859e6bf1f0e5@FreeBSD.org> <20160516070314.GA3029@brick> From: Andriy Gapon Message-ID: Date: Mon, 16 May 2016 10:43:08 +0300 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <20160516070314.GA3029@brick> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 07:43:49 -0000 On 16/05/2016 10:03, Edward Tomasz NapieraƂa wrote: > On 0515T1637, Andriy Gapon wrote: >> >> I am curious about the purpose of mountcheckdirs() called when mounting and >> unmounting a filesystem. > > [..] > > Whatever you do, please make sure you don't break autofs, and reroot, > esp. firmware(9) loading after reroot. I'll happily test patches, just > mail them to me. Thanks :-) > Well, the only patch I had in mind (besides https://svnweb.freebsd.org/changeset/base/299913) is completely removing mountcheckdirs(). But now that you mentioned autofs and reroot I am not sure that it could be that simple... -- Andriy Gapon From owner-freebsd-fs@freebsd.org Mon May 16 07:47:40 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id ECA54B3D1CB for ; Mon, 16 May 2016 07:47:40 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DDD971F32 for ; Mon, 16 May 2016 07:47:40 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4G7leqa073013 for ; Mon, 16 May 2016 07:47:40 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 207464] Panic when destroying ZFS snapshot Date: Mon, 16 May 2016 07:47:40 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.2-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: delphij@FreeBSD.org X-Bugzilla-Status: In Progress X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: avg@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 07:47:41 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D207464 Xin LI changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |delphij@FreeBSD.org, | |re@FreeBSD.org --- Comment #26 from Xin LI --- EN candidate? --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-fs@freebsd.org Mon May 16 07:47:44 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 27999B3D1DD for ; Mon, 16 May 2016 07:47:44 +0000 (UTC) (envelope-from hps@selasky.org) Received: from mail.turbocat.net (mail.turbocat.net [IPv6:2a01:4f8:d16:4514::2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DED251F43; Mon, 16 May 2016 07:47:43 +0000 (UTC) (envelope-from hps@selasky.org) Received: from laptop015.home.selasky.org (unknown [62.141.129.119]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.turbocat.net (Postfix) with ESMTPSA id 306171FE023; Mon, 16 May 2016 09:47:42 +0200 (CEST) Subject: Re: bug in umass? To: "K. Macy" , "freebsd-fs@FreeBSD.org" , Alexander Motin References: From: Hans Petter Selasky Message-ID: Date: Mon, 16 May 2016 09:51:01 +0200 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101 Thunderbird/45.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 07:47:44 -0000 Hi Alexander, Does dumping core on a USB stick from KDB require any threads? From the USB point of view we are doing polling in "umass_cam_poll()". --HPS On 05/16/16 07:02, K. Macy wrote: > I'm not able to complete a coredump in i915 to a USB key. The > backtrace in the log looks like a bug in umass. > > > May 15 21:57:10 beastie kernel: cmap[0]=0 cmap[1]=7f0000 cmap[2]=7f00 > cmap[3]=c4a000 > May 15 21:57:10 beastie kernel: end FB_INFO > May 15 21:57:10 beastie kernel: drmn0: fb0: inteldrmfb frame buffer device > May 15 21:57:10 beastie kernel: ..3% > May 15 21:58:16 beastie syslogd: kernel boot file is /boot/kernel/kernel > May 15 21:58:16 beastie kernel: trap_fatal() at trap_fatal+0x2d/frame > 0xfffffe01e2fd6350 > May 15 21:58:16 beastie kernel: trap() at trap+0xc48/frame 0xfffffe01e2fd6690 > May 15 21:58:16 beastie kernel: trap_check() at trap_check+0x4a/frame > 0xfffffe01e2fd66b0 > May 15 21:58:16 beastie kernel: calltrap() at calltrap+0x8/frame > 0xfffffe01e2fd66b0 > May 15 21:58:16 beastie kernel: --- trap 0x9, rip = > 0xffffffff80f5a950, rsp = 0xfffffe01e2fd6780, rbp = 0xfffffe01e2fd6810 > --- > May 15 21:58:16 beastie kernel: __mtx_lock_flags() at > __mtx_lock_flags+0xd0/frame 0xfffffe01e2fd6810 > May 15 21:58:16 beastie kernel: xpt_done_process() at > xpt_done_process+0x495/frame 0xfffffe01e2fd68c0 > May 15 21:58:16 beastie kernel: xpt_done_td() at > xpt_done_td+0x1c0/frame 0xfffffe01e2fd6930 > May 15 21:58:16 beastie kernel: fork_exit() at fork_exit+0x13b/frame > 0xfffffe01e2fd69b0 > May 15 21:58:16 beastie kernel: fork_trampoline() at > fork_trampoline+0xe/frame 0xfffffe01e2fd69b0 > May 15 21:58:16 beastie kernel: --- trap 0, rip = 0, rsp = 0, rbp = 0 --- > From owner-freebsd-fs@freebsd.org Mon May 16 08:43:48 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6B80DB3D427 for ; Mon, 16 May 2016 08:43:48 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: from mail-lf0-x22c.google.com (mail-lf0-x22c.google.com [IPv6:2a00:1450:4010:c07::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 064681865; Mon, 16 May 2016 08:43:47 +0000 (UTC) (envelope-from mavbsd@gmail.com) Received: by mail-lf0-x22c.google.com with SMTP id j8so113704492lfd.2; Mon, 16 May 2016 01:43:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:subject :references:in-reply-to:content-transfer-encoding; bh=0XaU0hNJ4iSr+DIl6jiFT4HDIFaPKMjrGCSY1E8edqk=; b=gh64LiD7UnQFQJY7GViNwHhQAXtCl68UfpyPX58y0aBMND6iFNQGtaeXyzX6wvfXcD J37CvOy4kRxrTdS2oPuv4zjK8uw+L9IXLJt56+ElTNxzLLUUd4ioRVIhHqoI+Jj4LJhe N8J3iT5tTEeu+2C4UPI14B7d/+g3zaqSPNSLn2rc++1kwsR54vPhMofF3n95Rqqc9x6O Omrsu9rJlOfnxJQQG9pUqk6ac38hhIhT0nKuHNY6NzYNouwOyoma9tskgvKK6L+Mck7u nCPPjfT4j9GPI09rdyJv2IrW3AO2j+nAFuQRmI93CkEoebDAw9Vvk2q1zo3ZYPRK26l9 pWkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:sender:message-id:date:from:user-agent :mime-version:to:subject:references:in-reply-to :content-transfer-encoding; bh=0XaU0hNJ4iSr+DIl6jiFT4HDIFaPKMjrGCSY1E8edqk=; b=i0ZwWw0ZIOCUqzgxoxUt/Mr4W7flWrL3vVNA8fRV/kD9ivzpclSN2mhlX6R6kj/Kzt ezivhgwlTySvb7IV8smifmnf12FCdlMmfgtULVcfeu2Lqanu6xiMFBzsOCIdW+uXnmaB pfPJ4+tF/MPZeImhY9EwxT5YH1O2W0CL9quYkp5qMzlKl9Gg9h+35fpzwh2TzDvUzeho Ko9U47U+vOz1xjH2sX0XvnK4B6SV68ncbFeaKmNHUv1Hp9qUZ1JJTssMrSxwE6VILAb3 TcxRDvoT3b8rse1PE0Q7hPsZIlSaZAjgp/v8M+eSH0zd27x0BN6ywMoQtLcruQaXBQDm 4gRQ== X-Gm-Message-State: AOPr4FU0DJC8vMnH7Wb0a9QxDNLJ4o3o+SHjnBG59m+0yWIkDOlJPeXTcdXDxl8ndYpwGg== X-Received: by 10.25.159.7 with SMTP id i7mr442292lfe.130.1463388226208; Mon, 16 May 2016 01:43:46 -0700 (PDT) Received: from mavbook.mavhome.dp.ua ([134.249.139.101]) by smtp.googlemail.com with ESMTPSA id o91sm5256487lfi.41.2016.05.16.01.43.44 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 16 May 2016 01:43:45 -0700 (PDT) Sender: Alexander Motin Message-ID: <57398840.6010700@FreeBSD.org> Date: Mon, 16 May 2016 11:43:44 +0300 From: Alexander Motin User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:31.0) Gecko/20100101 Thunderbird/31.6.0 MIME-Version: 1.0 To: Hans Petter Selasky , "K. Macy" , "freebsd-fs@FreeBSD.org" Subject: Re: bug in umass? References: In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 08:43:48 -0000 On 16.05.16 10:51, Hans Petter Selasky wrote: > Hi Alexander, > > Does dumping core on a USB stick from KDB require any threads? > > From the USB point of view we are doing polling in "umass_cam_poll()". CAM does not differentiate USB from others. Kernel dumping completely bypasses GEOM and its threads, manually pushes CAM queues without requiring context switches, does polling for CAM HBA drivers via respective method call, and processes completion queues without depending on completion threads. > On 05/16/16 07:02, K. Macy wrote: >> I'm not able to complete a coredump in i915 to a USB key. The >> backtrace in the log looks like a bug in umass. >> >> >> May 15 21:57:10 beastie kernel: cmap[0]=0 cmap[1]=7f0000 cmap[2]=7f00 >> cmap[3]=c4a000 >> May 15 21:57:10 beastie kernel: end FB_INFO >> May 15 21:57:10 beastie kernel: drmn0: fb0: inteldrmfb frame buffer >> device >> May 15 21:57:10 beastie kernel: ..3% >> May 15 21:58:16 beastie syslogd: kernel boot file is /boot/kernel/kernel >> May 15 21:58:16 beastie kernel: trap_fatal() at trap_fatal+0x2d/frame >> 0xfffffe01e2fd6350 >> May 15 21:58:16 beastie kernel: trap() at trap+0xc48/frame >> 0xfffffe01e2fd6690 >> May 15 21:58:16 beastie kernel: trap_check() at trap_check+0x4a/frame >> 0xfffffe01e2fd66b0 >> May 15 21:58:16 beastie kernel: calltrap() at calltrap+0x8/frame >> 0xfffffe01e2fd66b0 >> May 15 21:58:16 beastie kernel: --- trap 0x9, rip = >> 0xffffffff80f5a950, rsp = 0xfffffe01e2fd6780, rbp = 0xfffffe01e2fd6810 >> --- >> May 15 21:58:16 beastie kernel: __mtx_lock_flags() at >> __mtx_lock_flags+0xd0/frame 0xfffffe01e2fd6810 >> May 15 21:58:16 beastie kernel: xpt_done_process() at >> xpt_done_process+0x495/frame 0xfffffe01e2fd68c0 >> May 15 21:58:16 beastie kernel: xpt_done_td() at >> xpt_done_td+0x1c0/frame 0xfffffe01e2fd6930 >> May 15 21:58:16 beastie kernel: fork_exit() at fork_exit+0x13b/frame >> 0xfffffe01e2fd69b0 >> May 15 21:58:16 beastie kernel: fork_trampoline() at >> fork_trampoline+0xe/frame 0xfffffe01e2fd69b0 >> May 15 21:58:16 beastie kernel: --- trap 0, rip = 0, rsp = 0, rbp = 0 --- >> > -- Alexander Motin From owner-freebsd-fs@freebsd.org Mon May 16 10:15:15 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7B911B3C34D for ; Mon, 16 May 2016 10:15:15 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from mail.pingpong.net (mail.pingpong.net [79.136.116.202]) by mx1.freebsd.org (Postfix) with ESMTP id 44DA2131F for ; Mon, 16 May 2016 10:15:14 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from [172.16.0.5] (citron.pingpong.net [195.178.173.66]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.pingpong.net (Postfix) with ESMTPSA id 3F6601667F for ; Mon, 16 May 2016 12:08:39 +0200 (CEST) From: Palle Girgensohn X-Pgp-Agent: GPGMail 2.6b2 Content-Type: multipart/signed; boundary="Apple-Mail=_7DF1126C-CDCD-4386-958B-B5EEAA0A8866"; protocol="application/pgp-signature"; micalg=pgp-sha256 Subject: Best practice for high availability ZFS pool Date: Mon, 16 May 2016 12:08:38 +0200 Message-Id: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> To: freebsd-fs@freebsd.org Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 10:15:15 -0000 --Apple-Mail=_7DF1126C-CDCD-4386-958B-B5EEAA0A8866 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Hi, We need to set up a ZFS pool with redundance. The main goal is high = availability - uptime. I can see a few of paths to follow. 1. HAST + ZFS 2. Some sort of shared storage, two machines sharing a JBOD box. 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) 4. using something else than ZFS, even a different OS if required. My main concern with HAST+ZFS is performance. Google offer some insights = here, I find mainly unsolved problems. Please share any success stories = or other experiences. Shared storage still has a single point of failure, the JBOD box. Apart = from that, is there even any support for the kind of storage PCI cards = that support dual head for a storage box? I cannot find any. We are running with ZFS replication today, but it is just too slow for = the amount of data. We prefer to keep ZFS as we already have a rather big (~30 TB) pool and = also tools, scripts, backup all is using ZFS, but if there is no = solution using ZFS, we're open to alternatives. Nexenta springs to mind, = but I believe it is using shared storage for redundance, so it does have = single points of failure? Any other suggestions? Please share your experience. :) Palle --Apple-Mail=_7DF1126C-CDCD-4386-958B-B5EEAA0A8866 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQEcBAEBCAAGBQJXOZwmAAoJEDQn0sf36Uls3OsIAKelrK6+NG5LtCsTOOa0yy+2 5JCUsdVm+77IOGC2tl8AVTYGK2SODSiW1WZZiCNB4pfRFacKjNqBGSm3GDUizh0k CvALLacegxvSqsKzu6TMUifQa0UPZayoswLMKbfj6zavKy+kxFyHTiursS7X5XgZ HOng44oq+p1S+AtvXvRTDegSMIIAzYFppznN+vNuyv1qjOBoPJaf4gp9g2PCfOi0 S6iz/J5f1x2Dtg3m/FAky7nn2Kj6BgajyLaLtOi1lhoG91PqKFngneFidjtLS0Sr vmaIO/HGQpaPi2RaEpAODoQzUbfF7fdp8IVRYlS1FuqzdfIckF/MzpyKJV6R7E0= =RLQy -----END PGP SIGNATURE----- --Apple-Mail=_7DF1126C-CDCD-4386-958B-B5EEAA0A8866-- From owner-freebsd-fs@freebsd.org Mon May 16 13:18:30 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8B05EB3D28C for ; Mon, 16 May 2016 13:18:30 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: from smtp.digiware.nl (unknown [IPv6:2001:4cb8:90:ffff::3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 1FE4818B9 for ; Mon, 16 May 2016 13:18:29 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: from rack1.digiware.nl (unknown [127.0.0.1]) by smtp.digiware.nl (Postfix) with ESMTP id B393815340D; Mon, 16 May 2016 15:18:18 +0200 (CEST) X-Virus-Scanned: amavisd-new at digiware.nl Received: from smtp.digiware.nl ([127.0.0.1]) by rack1.digiware.nl (rack1.digiware.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id Cjn8AI7kTrdo; Mon, 16 May 2016 15:18:17 +0200 (CEST) Received: from [192.168.10.10] (asus [192.168.10.10]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.digiware.nl (Postfix) with ESMTPSA id 6EA0A153401; Mon, 16 May 2016 15:18:17 +0200 (CEST) Subject: Bigger MAX_PATH (Was: Re: State of native encryption in ZFS) To: Niall Douglas , "freebsd-fs@FreeBSD.org" References: <5736E7B4.1000409@gmail.com> <57378707.19425.B54772B@s_sourceforge.nedprod.com> <57385356.4525.E728971@s_sourceforge.nedprod.com> From: Willem Jan Withagen Message-ID: <9ead4b28-9711-5e38-483f-ef9eaf0bc583@digiware.nl> Date: Mon, 16 May 2016 15:18:17 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.0 MIME-Version: 1.0 In-Reply-To: <57385356.4525.E728971@s_sourceforge.nedprod.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 13:18:30 -0000 On 15-5-2016 12:45, Niall Douglas via freebsd-fs wrote: > On 14 May 2016 at 16:09, K. Macy wrote: > >>>> It’s not even clear how that encryption would be implemented or exposed. >>>> Per pool? Per dataset? Per folder? Per file? There have been >>>> requests for all of the above at one time or another, and the key >>>> management challenges for each are different. They can also be >>>> implemented at a layer above ZFS, given sufficient interest. >>> >>> If FreeBSD had a bigger PATH_MAX then stackable encryptions layers >>> like ecryptfs (encfs?) would be viable choices. Because encrypted >>> path components are so long, one runs very rapidly into the maximum >>> path on the system when PATH_MAX is so low. >>> >>> I ended up actually installing ZFS on Linux with ecryptfs on top to >>> solve this. Every 15 minutes it ZFS snapshot syncs with the FreeBSD >>> edition. This works very well, apart from the poor performance of ZFS >>> on Linux. >>> >>> ZFS handles long paths with ease. FreeBSD currently does not :( >> >> AFAICT that's a 1 line patch. Have you tried patching that and >> rebuilding kernel, world, and any vulnerable ports? > > The problem is apparently kernel structure bloat and that they want > to remove fixed maximum paths altogether so it would be boot > modifiable. > > http://freebsd.1045724.n5.nabble.com/misc-184340-PATH-MAX-not-interope > rable-with-Linux-td5864469.html > > As laudable as the latter goal is, unfortunately OS X and Linux hard > code theirs, and much POSIX software will use whatever PATH_MAX is > set to. I'm therefore not sure the implementation cost is worth it. > > In any case, a 1024 byte path limit is just 256 unicode characters > potentially. That's worse than Windows 95 :( I'm pretty sure that just about everybody that runs a somewhat bigger ZFS installation runs into this a one point or another. The weekly locate database build nags (after every fresh install) me for about 4 years already that it needs a larger path than 1024. And then I just dig into the source to up the value. the locate.db does not really care. I think I go a reply from Jilles around that time, that changing the defines might cause unwanted compatibility fallout. That was an answer sure enough to keep my hands from just doing the 1-line patch. Trying to port Ceph is also running into the limit in: /usr/include/sys/syslimits.h: #define NAME_MAX 255 /* max bytes in a file name */ but I also found: /usr/include/stdio.h: #define FILENAME_MAX 1024 /* must be <= PATH_MAX */ So take a pick?? --WjW From owner-freebsd-fs@freebsd.org Mon May 16 13:56:38 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 781DAB3DF25 for ; Mon, 16 May 2016 13:56:38 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from cu01176a.smtpx.saremail.com (cu01176a.smtpx.saremail.com [195.16.150.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 36C3A135D; Mon, 16 May 2016 13:56:37 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11]) by proxypop03.sare.net (Postfix) with ESMTPSA id 35EB89DD37C; Mon, 16 May 2016 15:51:03 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Best practice for high availability ZFS pool From: Borja Marcos In-Reply-To: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> Date: Mon, 16 May 2016 15:51:02 +0200 Cc: freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> To: Palle Girgensohn X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 13:56:38 -0000 > On 16 May 2016, at 12:08, Palle Girgensohn wrote: >=20 > Hi, >=20 > We need to set up a ZFS pool with redundance. The main goal is high = availability - uptime. >=20 > I can see a few of paths to follow. >=20 > 1. HAST + ZFS Which means that a possible corruption causing bug in ZFS would vaporize = the data of both replicas. > 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) If you don=E2=80=99t have a hard requirement for synchronous replication = (and, in that case, I would opt for a more application aware approach) it=E2=80=99s the best method in my opinion. Borja. From owner-freebsd-fs@freebsd.org Mon May 16 14:52:37 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AE6F6B3D121 for ; Mon, 16 May 2016 14:52:37 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: from connect.ultra-secure.de (connect.ultra-secure.de [88.198.71.201]) by mx1.freebsd.org (Postfix) with ESMTP id CA45419F7; Mon, 16 May 2016 14:52:36 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: (Haraka outbound); Mon, 16 May 2016 16:52:29 +0200 Authentication-Results: connect.ultra-secure.de; iprev=pass; auth=pass (plain); spf=none smtp.mailfrom=ultra-secure.de Received-SPF: None (connect.ultra-secure.de: domain of ultra-secure.de does not designate 217.71.83.52 as permitted sender) receiver=connect.ultra-secure.de; identity=mailfrom; client-ip=217.71.83.52; helo=[192.168.1.200]; envelope-from= Received: from [192.168.1.200] (217-071-083-052.ip-tech.ch [217.71.83.52]) by connect.ultra-secure.de (Haraka/2.6.2-toaster) with ESMTPSA id 9062EEE8-5B4C-48E7-B021-F8137F8512A3.1 envelope-from (authenticated bits=0) (version=TLSv1/SSLv3 cipher=AES256-SHA verify=NO); Mon, 16 May 2016 16:52:26 +0200 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Best practice for high availability ZFS pool From: Rainer Duffner In-Reply-To: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> Date: Mon, 16 May 2016 16:52:24 +0200 Cc: freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <284D58D1-1C62-4519-A46B-7D0E8326B86B@ultra-secure.de> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> To: Palle Girgensohn X-Mailer: Apple Mail (2.3124) X-Haraka-GeoIP: EU, CH, 451km X-Haraka-ASN: 24951 X-Haraka-GeoIP-Received: X-Haraka-ASN: 24951 217.71.80.0/20 X-Haraka-ASN-CYMRU: asn=24951 net=217.71.80.0/20 country=CH assignor=ripencc date=2003-08-07 X-Haraka-FCrDNS: 217-071-083-052.ip-tech.ch X-Haraka-p0f: os="Mac OS X " link_type="DSL" distance=13 total_conn=1 shared_ip=N X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on spamassassin X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no version=3.4.1 X-Haraka-Karma: score: 6, good: 166, bad: 0, connections: 326, history: 166, asn_score: 100, asn_connections: 111, asn_good: 100, asn_bad: 0, pass:all_good, asn, asn_all_good, relaying X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 14:52:37 -0000 > Am 16.05.2016 um 12:08 schrieb Palle Girgensohn : >=20 > Hi, >=20 > We need to set up a ZFS pool with redundance. The main goal is high = availability - uptime. >=20 > I can see a few of paths to follow. >=20 > 1. HAST + ZFS >=20 > 2. Some sort of shared storage, two machines sharing a JBOD box. >=20 > 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >=20 > 4. using something else than ZFS, even a different OS if required. There=E2=80=99s always GlusterFS. Recently ported to FreeBSD and available as net/gulsterfs (10.3 = recommended, AFAIK). At work, we use it on Ubuntu - but not with so much data. On Linux, I=E2=80=99d use it on top of XFS. For our Cloud-Storage, we went with ScaleIO (which is Linux only). You need more than two nodes with Gluster, though (for production use) I think my co-worker said four at least. If you have the money and don=E2=80=99t mind Linux, ScaleIO is probably = the best you can buy at the moment. While licensed at the GByte-Level (yeah, EMC=E2=80=A6) it can be used = free of charge, unsupported. From owner-freebsd-fs@freebsd.org Mon May 16 15:38:20 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C9FCAB3DF5D for ; Mon, 16 May 2016 15:38:20 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id BA8E41BCD for ; Mon, 16 May 2016 15:38:20 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4GFcKBB050055 for ; Mon, 16 May 2016 15:38:20 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209093] ZFS snapshot rename : .zfs/snapshot messes up Date: Mon, 16 May 2016 15:38:20 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: commit-hook@freebsd.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 15:38:20 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209093 --- Comment #1 from commit-hook@freebsd.org --- A commit references this bug: Author: avg Date: Mon May 16 15:37:41 UTC 2016 New revision: 299949 URL: https://svnweb.freebsd.org/changeset/base/299949 Log: try to recycle "snap" vnodes as soon as possible Those vnodes should not linger. "Stale" nodes may get out of synchronization with actual snapshots. For example if we destroy a snapshot and create a new one with the same name. Or when we rename a snapshot. While there fix the argument type for zfsctl_snapshot_reclaim(). Also, its original argument can be passed to gfs_vop_reclaim() directly. Bug 209093 could be related although I have not specifically verified that. Referencing just in case. PR: 209093 MFC after: 5 weeks Changes: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ctldir.c --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Mon May 16 16:32:52 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A8DAFB3D2E4 for ; Mon, 16 May 2016 16:32:52 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 99A1E1ACD for ; Mon, 16 May 2016 16:32:52 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4GGWqnn004439 for ; Mon, 16 May 2016 16:32:52 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209093] ZFS snapshot rename : .zfs/snapshot messes up Date: Mon, 16 May 2016 16:32:52 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: avg@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: attachments.created Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 16:32:52 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209093 --- Comment #2 from Andriy Gapon --- Created attachment 170370 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D170370&action= =3Dedit proposed patch for testing I think I've found the cause of this problem. I can't believe it but it seems that the 'allow_mounted' check was reversed= for 3 years since it was introduced in 2013. --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Mon May 16 16:37:07 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EF46AB3D3AB for ; Mon, 16 May 2016 16:37:07 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E02F51BE8 for ; Mon, 16 May 2016 16:37:07 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4GGb7H5010506 for ; Mon, 16 May 2016 16:37:07 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209093] ZFS snapshot rename : .zfs/snapshot messes up Date: Mon, 16 May 2016 16:37:08 +0000 X-Bugzilla-Reason: CC AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: avg@FreeBSD.org X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: avg@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: cc assigned_to bug_status Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 16:37:08 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209093 Andriy Gapon changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |freebsd-fs@FreeBSD.org Assignee|freebsd-fs@FreeBSD.org |avg@FreeBSD.org Status|New |Open --=20 You are receiving this mail because: You are on the CC list for the bug. You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Mon May 16 16:45:23 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A75C5B3D760 for ; Mon, 16 May 2016 16:45:23 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 96EED13A6 for ; Mon, 16 May 2016 16:45:23 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4GGjNp5031824 for ; Mon, 16 May 2016 16:45:23 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 207464] Panic when destroying ZFS snapshot Date: Mon, 16 May 2016 16:45:23 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.2-STABLE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: karl@denninger.net X-Bugzilla-Status: In Progress X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: avg@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 16:45:23 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D207464 --- Comment #27 from karl@denninger.net --- Comment on attachment 170343 --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=3D170343 add-on patch Rebuilding kernel to include this as well.... --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-fs@freebsd.org Mon May 16 19:22:33 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 30B36B3DE9B for ; Mon, 16 May 2016 19:22:33 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 0025510B2 for ; Mon, 16 May 2016 19:22:32 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id u4GJMQNr072510 for ; Mon, 16 May 2016 12:22:30 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201605161922.u4GJMQNr072510@gw.catspoiler.org> Date: Mon, 16 May 2016 12:22:26 -0700 (PDT) From: Don Lewis Subject: patch to fix Coverity CIDs in rpc.statd find_host() To: freebsd-fs@FreeBSD.org MIME-Version: 1.0 Content-Type: TEXT/plain; charset=us-ascii X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 19:22:33 -0000 Coverity barfed all over find_host() in rpc.statd. I put a patch up for review here: . I'd like to get some other eyeballs on it before I commit it. From owner-freebsd-fs@freebsd.org Mon May 16 20:06:15 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9C80EB3EF31 for ; Mon, 16 May 2016 20:06:15 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (vps.rulingia.com [103.243.244.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "rulingia.com", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 492601AAD for ; Mon, 16 May 2016 20:06:14 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from server.rulingia.com (ppp59-167-167-3.static.internode.on.net [59.167.167.3]) by vps.rulingia.com (8.15.2/8.15.2) with ESMTPS id u4GK5nPV000925 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 17 May 2016 06:05:56 +1000 (AEST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.15.2/8.15.2) with ESMTPS id u4GK5iid028001 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Tue, 17 May 2016 06:05:44 +1000 (AEST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.15.2/8.15.2/Submit) id u4GK5h0w028000; Tue, 17 May 2016 06:05:43 +1000 (AEST) (envelope-from peter) Date: Tue, 17 May 2016 06:05:43 +1000 From: Peter Jeremy To: Willem Jan Withagen Cc: "freebsd-fs@FreeBSD.org" Subject: Re: Bigger MAX_PATH (Was: Re: State of native encryption in ZFS) Message-ID: <20160516200543.GC42426@server.rulingia.com> References: <5736E7B4.1000409@gmail.com> <57378707.19425.B54772B@s_sourceforge.nedprod.com> <57385356.4525.E728971@s_sourceforge.nedprod.com> <9ead4b28-9711-5e38-483f-ef9eaf0bc583@digiware.nl> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="8t9RHnE3ZwKMSgU+" Content-Disposition: inline In-Reply-To: <9ead4b28-9711-5e38-483f-ef9eaf0bc583@digiware.nl> X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.6.1 (2016-04-27) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 20:06:15 -0000 --8t9RHnE3ZwKMSgU+ Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2016-May-16 15:18:17 +0200, Willem Jan Withagen wrote: >Trying to port Ceph is also running into the limit in: >/usr/include/sys/syslimits.h: >#define NAME_MAX 255 /* max bytes in a file name */ > >but I also found: >/usr/include/stdio.h: >#define FILENAME_MAX 1024 /* must be <=3D PATH_MAX */ > >So take a pick?? There are two distinct limits: The maximum number of characters in a pathname component (ie the name seen in a directory entry): For UFS, this is 255 because the length is stored on disk in a uint8_t (I don't know the limit for ZFS). The other limit is the maximum number of characters in a pathname - PATH_MAX. This is used to dimension various buffers but isn't persistent on disk so you should be able to increase it by changing the relevant #defines and rebuilding everything. --=20 Peter Jeremy --8t9RHnE3ZwKMSgU+ Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQJ8BAEBCgBmBQJXOigXXxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXRFRUIyOTg2QzMwNjcxRTc0RTY1QzIyN0Ux NkE1OTdBMEU0QTIwQjM0AAoJEBall6Dkogs0JXsP/3kcHzh+YSnEjcbMb2eY7qSZ 0U5XU/DiK9ko2VpKELqw+y3cQoeyu0YT7IlvIVWqSDtb4NdJ5o/AcAtP7JO6c4ot JyMwOu1VvyFm9ZZ7cR9AGJ7GH0/YtcXBYTlXkrHqwi1vg18AhL0kFH+VD3uAQn/o 9bkLvJKxGaf5MSQyBoHY4jjBCHU2wN3+nu/ZS7ZZMJ27qYEyX1CCpqSoV4wpJIFC 1JUGz4lhhk+J1qdqN94AbnoD3iYos1HBIiFo8gzVCEngnzfFhSE9DIbTRH7HUQit EBmpi8fb3gCeOLTj1qmc0qE5MGLz2Y4m/GWoqMgkPpHq+957LYIihUEktfuviHdC 6xKDVuQBFqv3lrt1DaboRmobnEVBephKlTgpNoYM2z/n8oEgkEukQUGui+pArqFK RjN+pnLOzMyUoK1I39eRR1WN120KV7RdOEvIYdKZEZFKhtJ95yN4lxXseQrAAQ2C SA0NXoNW3VU5ZZGur0m7yRj8YbxHxCdB4rqAX0ppoPngo5nrTXJHtGTp1N4lRbvl Qzqk1Wq1CbejtY9i3VCSEWXK/d3tXnGGkDFu4Faq2rSoJwy2f7Mr2Kv88eq+dkwE JJlB555ZhdkIlCy+Ypt8N5TSEgxwm7pXVjXbr87YTL+hrG5nhJGSXhUSX3ZJvz4R 6ALzojz8wNTEh527wvXR =43s8 -----END PGP SIGNATURE----- --8t9RHnE3ZwKMSgU+-- From owner-freebsd-fs@freebsd.org Mon May 16 20:19:17 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 35D4CB3D40A for ; Mon, 16 May 2016 20:19:17 +0000 (UTC) (envelope-from opticz7g__toypmoypru__r4@rayman.beget.ru) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 245A71363 for ; Mon, 16 May 2016 20:19:17 +0000 (UTC) (envelope-from opticz7g__toypmoypru__r4@rayman.beget.ru) Received: by mailman.ysv.freebsd.org (Postfix) id 1FD96B3D409; Mon, 16 May 2016 20:19:17 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 06F7FB3D405 for ; Mon, 16 May 2016 20:19:17 +0000 (UTC) (envelope-from opticz7g__toypmoypru__r4@rayman.beget.ru) Received: from m2.rayman.beget.ru (m2.rayman.beget.ru [87.236.19.11]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6CD001362 for ; Mon, 16 May 2016 20:19:16 +0000 (UTC) (envelope-from opticz7g__toypmoypru__r4@rayman.beget.ru) Received: from opticz7g (Authenticated sender opticz7g@rayman.beget.ru) by rayman.beget.ru with local (Exim 4.76) (envelope-from ) id 1b2Oyu-0007wC-Qe for fs@freebsd.org; Mon, 16 May 2016 23:19:12 +0300 To: fs@freebsd.org Subject: Notice of appearance in Court #00630654 Date: Mon, 16 May 2016 23:19:12 +0300 From: "County Court" Reply-To: "County Court" Message-ID: <69adcbc1da1b1e8dfee1445db35757b6@toy-moy.ru> X-Priority: 3 MIME-Version: 1.0 Precedence: bulk Content-Type: text/plain; charset=us-ascii X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 20:19:17 -0000 Notice to Appear, You have to appear in the Court on the May 23. Please, prepare all the documents relating to the case and bring them to Court on the specified date. Note: The case may be heard by the judge in your absence if you do not come. You can review complete details of the Court Notice in the attachment. Kind regards, Rick Finley, Clerk of Court. From owner-freebsd-fs@freebsd.org Mon May 16 21:14:04 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AED72B3EAED for ; Mon, 16 May 2016 21:14:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 9E3DD1F1B for ; Mon, 16 May 2016 21:14:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 99E50B3EAEC; Mon, 16 May 2016 21:14:04 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9988CB3EAEB for ; Mon, 16 May 2016 21:14:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id 2F6BF1F19; Mon, 16 May 2016 21:14:03 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 823C57837EC; Tue, 17 May 2016 07:13:56 +1000 (AEST) Date: Tue, 17 May 2016 07:13:55 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: fs@freebsd.org cc: rmacklem@freebsd.org Subject: fixes for i/o counting in nfs Message-ID: <20160517063058.E2021@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=ao2s4AhymvbEqQ6vx_0A:9 a=8qM5l07LXi6s6vM6:21 a=vbcHcmFXaA1j43Cz:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 21:14:04 -0000 nfs doesn't count block inputs in resource usage. It seems to count block outputs well enough (not very well, since buffering and threading causes some i/o's to be done by other threads where the counts are hard to see and harder to associate with the actual user). nfs doesn't support per-mount i/o counts for either input and output. These patches are for an old version of oldnfs. They apply cleanly to oldnfs in FreeBSD-10. I'm not sure if I found all the i/o's and don't trust my thread and mount pointer handling, but they seem to work reasonably and never find a null pointer. The per-mount i/o counts were easier to do than for most file systems since nfs isn't handicapped by using geom. The corresponding code in g_vfs_strategy() has a harder time finding the mount point and often fails, so must check for null pointers and not work when it can't find the mount point. This code doesn't even exist in the version that this patch is for (except I patch it in). X Index: nfs_bio.c X =================================================================== X --- nfs_bio.c (revision 181737) X +++ nfs_bio.c (working copy) X @@ -1568,6 +1581,14 @@ X case VREG: X uiop->uio_offset = ((off_t)bp->b_blkno) * DEV_BSIZE; X nfsstats.read_bios++; X + if (td == NULL) X + curthread->td_ru.ru_inblock++; /* XXX */ X + else X + td->td_ru.ru_inblock++; /* XXX? */ These are XXX'ed since I don't know if td is ever null or always right when it is non-null. But this seems to work right -- some counts go to normal threads and some to nfsiod's. X + if (LK_HOLDER(bp->b_lock.lk_lock) == LK_KERNPROC) X + vp->v_mount->mnt_stat.f_asyncreads++; /* XXX */ X + else X + vp->v_mount->mnt_stat.f_syncreads++; X error = (nmp->nm_rpcops->nr_readrpc)(vp, uiop, cr); This is XXX'ed since I don't trust the LK_KERNPROC check at all. This was blindly copied from g_vfs_strategy(). A separate count for async _reads_ is not very useful anyway. It is mostly for read-ahead. Most reads should be ahead, but complicated buffering in hardware and software makes them hard to count and the counts not very useful. X X if (!error) { X @@ -1674,10 +1695,16 @@ X io.iov_base = (char *)bp->b_data + bp->b_dirtyoff; X uiop->uio_rw = UIO_WRITE; X nfsstats.write_bios++; X + if (td == NULL) X + curthread->td_ru.ru_oublock++; /* XXX */ X + else X + td->td_ru.ru_oublock++; /* XXX? */ As above. X X if ((bp->b_flags & (B_ASYNC | B_NEEDCOMMIT | B_NOCACHE | B_CLUSTER)) == B_ASYNC) X + vp->v_mount->mnt_stat.f_asyncwrites++, X iomode = NFSV3WRITE_UNSTABLE; X else X + vp->v_mount->mnt_stat.f_syncwrites++, X iomode = NFSV3WRITE_FILESYNC; Here the sync/async decision is easy to make correctly. The patch uses a comma splice hack to keep the patch small. X X error = (nmp->nm_rpcops->nr_writerpc)(vp, uiop, cr, &iomode, &must_commit); X Index: nfs_vnops.c X =================================================================== X --- nfs_vnops.c (revision 181737) X +++ nfs_vnops.c (working copy) X @@ -3138,7 +3290,6 @@ X bp->b_iocmd = BIO_WRITE; X X bufobj_wref(bp->b_bufobj); X - curthread->td_ru.ru_oublock++; X splx(s); X X /* This is now counted in nfs_bio.c, and the results are much the same. Apparently it makes little difference to always use curthread. In file systems generally, we pass around td's and they are usually useless, but if they are good for anything at all then it is to record the (first) originator of the i/o so as to charge the originator and not a daemon. I don't know if they are used for that. Bruce From owner-freebsd-fs@freebsd.org Mon May 16 21:26:19 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6AF39B3EF13 for ; Mon, 16 May 2016 21:26:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 5AE8516F0 for ; Mon, 16 May 2016 21:26:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 56973B3EF12; Mon, 16 May 2016 21:26:19 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 563D4B3EF11 for ; Mon, 16 May 2016 21:26:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 2500716EF for ; Mon, 16 May 2016 21:26:18 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 338E2D6691C for ; Tue, 17 May 2016 07:26:09 +1000 (AEST) Date: Tue, 17 May 2016 07:26:08 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: fs@freebsd.org Subject: fix for per-mount i/o counting in ffs Message-ID: <20160517072104.I2137@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=5YkQZLojSFcQydPC5FAA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 21:26:19 -0000 Counting of i/o's in g_vfs_strategy() requires the fs to initialize devvp->v_rdev->si_mountpt to non-null. This seems to be done correctly in ext2fs and msdosfs, but in ffs it is not done for ro mounts, or for rw mounts that started as ro. The bug is most obvious for the root file system since it always starts as ro. The patch fixes 2 unrelated style bugs in comments. X Index: ffs_vfsops.c X =================================================================== X --- ffs_vfsops.c (revision 299263) X +++ ffs_vfsops.c (working copy) X @@ -512,7 +512,7 @@ X * We need the name for the mount point (also used for X * "last mounted on") copied in. If an error occurs, X * the mount point is discarded by the upper level code. X - * Note that vfs_mount() populates f_mntonname for us. X + * Note that vfs_mount_alloc() populates f_mntonname for us. X */ X if ((error = ffs_mountfs(devvp, mp, td)) != 0) { X vrele(devvp); X @@ -1049,8 +1049,6 @@ X ffs_flushfiles(mp, FORCECLOSE, td); X goto out; X } X - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) X - devvp->v_rdev->si_mountpt = mp; X if (fs->fs_snapinum[0] != 0) X ffs_snapshot_mount(mp); X fs->fs_fmod = 1; X @@ -1057,8 +1055,10 @@ X fs->fs_clean = 0; X (void) ffs_sbupdate(ump, MNT_WAIT, 0); X } X + if (devvp->v_type == VCHR && devvp->v_rdev != NULL) X + devvp->v_rdev->si_mountpt = mp; X /* X - * Initialize filesystem stat information in mount struct. X + * Initialize filesystem state information in mount struct. X */ X MNT_ILOCK(mp); X mp->mnt_kern_flag |= MNTK_LOOKUP_SHARED | MNTK_EXTENDED_SHARED | Bruce From owner-freebsd-fs@freebsd.org Mon May 16 21:54:37 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D47D5B387D4 for ; Mon, 16 May 2016 21:54:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id C46E518E0 for ; Mon, 16 May 2016 21:54:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id C008EB387D3; Mon, 16 May 2016 21:54:37 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BFAECB387D2 for ; Mon, 16 May 2016 21:54:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 8D6FA18DF for ; Mon, 16 May 2016 21:54:37 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id E24C64271DC for ; Tue, 17 May 2016 07:54:28 +1000 (AEST) Date: Tue, 17 May 2016 07:54:27 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: fs@freebsd.org Subject: quick fix for slow directory shrinking in ffs Message-ID: <20160517072705.F2157@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=pubc52WGR5en7ZIXB40A:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 21:54:37 -0000 ffs does very slow shrinking of directories after removing some files leaves unused blocks at the end, by always doing synchronous truncation. This often happens in my normal usage: medium size builds expand /tmp from 512 to 1024 to hold a few more hundred bytes of file names; expansion is async and fast, but shrinking is sync and slow, and with a certain size of build the boundary is crossed back and forth very often. My /tmp directory is always on an async-mounted file system, so this quick fix of always doing an async truncation for async mounts works for me. Using IO_SYNC when not asked to is a bug for async mounts in all cases anyway. The file system has block size 8192 and frag size 1024, so it is also wrong to shrink to size DIRBLKSIZE = 512. The shrinkage seems to be considered at every DIRBLKSIZE boundary, so not only small directories are affected. The patch fixes an unrelated typo in a message. X Index: ufs_lookup.c X =================================================================== X --- ufs_lookup.c (revision 299263) X +++ ufs_lookup.c (working copy) X @@ -1131,9 +1131,9 @@ X if (tvp != NULL) X VOP_UNLOCK(tvp, 0); X error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff, X - IO_NORMAL | IO_SYNC, cr); X + IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr); X if (error != 0) X - vprint("ufs_direnter: failted to truncate", dvp); X + vprint("ufs_direnter: failed to truncate", dvp); X #ifdef UFS_DIRHASH X if (error == 0 && dp->i_dirhash != NULL) X ufsdirhash_dirtrunc(dp, dp->i_endoff); Bruce From owner-freebsd-fs@freebsd.org Mon May 16 22:36:54 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CA742B3D6FC for ; Mon, 16 May 2016 22:36:54 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from mail.pingpong.net (mail.pingpong.net [79.136.116.202]) by mx1.freebsd.org (Postfix) with ESMTP id 92F0014E6 for ; Mon, 16 May 2016 22:36:54 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from [10.0.1.11] (h-155-4-128-242.na.cust.bahnhof.se [155.4.128.242]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.pingpong.net (Postfix) with ESMTPSA id 3120F16B2C; Tue, 17 May 2016 00:36:52 +0200 (CEST) Subject: Re: Best practice for high availability ZFS pool Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Content-Type: multipart/signed; boundary="Apple-Mail=_E981C4D9-6449-4B12-B476-356B0F43A9DD"; protocol="application/pgp-signature"; micalg=pgp-sha256 X-Pgp-Agent: GPGMail 2.6b2 From: Palle Girgensohn In-Reply-To: Date: Tue, 17 May 2016 00:36:51 +0200 Cc: freebsd-fs@freebsd.org Message-Id: <89D73122-FAC7-4449-AAB3-C4BBE74B960A@FreeBSD.org> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> To: Borja Marcos X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 22:36:54 -0000 --Apple-Mail=_E981C4D9-6449-4B12-B476-356B0F43A9DD Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > 16 maj 2016 kl. 15:51 skrev Borja Marcos : >=20 >=20 >> On 16 May 2016, at 12:08, Palle Girgensohn = wrote: >>=20 >> Hi, >>=20 >> We need to set up a ZFS pool with redundance. The main goal is high = availability - uptime. >>=20 >> I can see a few of paths to follow. >>=20 >> 1. HAST + ZFS >=20 > Which means that a possible corruption causing bug in ZFS would = vaporize the data of both replicas. >=20 >> 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >=20 > If you don=E2=80=99t have a hard requirement for synchronous = replication (and, in that case, I would opt for a more application > aware approach) it=E2=80=99s the best method in my opinion. That was exactly my thought 18 months ago, and we set up two systems = with zfs snapshot + zfs send | ssh | zfs receive. It works, but the = problem is it just too slow and a complete sync takes like 10 minutes = for all the file systems. We are forced to sync the file systems one at = a time to get the kind of control and separation we need. Even if we = could speed that up somehow, we are really looking for a more recilient = system. Also, constant snapshotting and writing makes scrub very slow so = we need to tune down the amount of syncing every fourth week-end to = scrub. It's OK but not optimal, so we're pondering for something better. My first choice is really HAST at the moment, but I also dont find much = written in the last couple of years, apart from some articles about = setting it up in very minimal testbeds or posts about performance and = stability troubles. This makes me wonder, is HAST actively maintained? = Is it stable, used and loved by the community? I'd love to hear some = success stories with farily large installations of at least 20 TB or so. Palle --Apple-Mail=_E981C4D9-6449-4B12-B476-356B0F43A9DD Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQEcBAEBCAAGBQJXOkuDAAoJEDQn0sf36Uls++IIAIGX1yPZt2BdPB9rly71u+TV 9jap9c0ZtUagYcwUNnUbKuShoEKr1FCyIv5trIB13CC7UieBV3f8AAprCa7fohb3 Hc5nENqjyqaG2udppYg7J5mXs1so5W6F9SdmSuIh2RSCvtV+aKm5ofmF+Ef7ZiEo zvR8jJzVcLEHm5RnpzQm1oU17U0eHwfF5fdWtaw69roHCWMk08MkQcJBocXORAh5 /+L7zzPxezQh4YeYfDnj9rC7vaerU8iyEQsw8MV6tY6gD+JiW1dfjZK6p0AwwkKk W876vHi+rbxpWt4bLYDBPbRsnRGYaL9AuX1bGSgAvXlhZS2Rod5DdnpoX5ez/+E= =kVC+ -----END PGP SIGNATURE----- --Apple-Mail=_E981C4D9-6449-4B12-B476-356B0F43A9DD-- From owner-freebsd-fs@freebsd.org Mon May 16 22:44:19 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AFF7DB3D9FC for ; Mon, 16 May 2016 22:44:19 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from mail.pingpong.net (mail.pingpong.net [79.136.116.202]) by mx1.freebsd.org (Postfix) with ESMTP id 76AF41D37 for ; Mon, 16 May 2016 22:44:19 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from [10.0.1.11] (h-155-4-128-242.na.cust.bahnhof.se [155.4.128.242]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.pingpong.net (Postfix) with ESMTPSA id 6737716B4D; Tue, 17 May 2016 00:44:18 +0200 (CEST) Subject: Re: Best practice for high availability ZFS pool Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Content-Type: multipart/signed; boundary="Apple-Mail=_7E2DA032-0BE8-495C-95AE-5A80E8AB857A"; protocol="application/pgp-signature"; micalg=pgp-sha256 X-Pgp-Agent: GPGMail 2.6b2 From: Palle Girgensohn In-Reply-To: <284D58D1-1C62-4519-A46B-7D0E8326B86B@ultra-secure.de> Date: Tue, 17 May 2016 00:44:18 +0200 Cc: freebsd-fs@freebsd.org Message-Id: References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <284D58D1-1C62-4519-A46B-7D0E8326B86B@ultra-secure.de> To: Rainer Duffner X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 22:44:19 -0000 --Apple-Mail=_7E2DA032-0BE8-495C-95AE-5A80E8AB857A Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > 16 maj 2016 kl. 16:52 skrev Rainer Duffner : >=20 >=20 >> Am 16.05.2016 um 12:08 schrieb Palle Girgensohn : >>=20 >> Hi, >>=20 >> We need to set up a ZFS pool with redundance. The main goal is high = availability - uptime. >>=20 >> I can see a few of paths to follow. >>=20 >> 1. HAST + ZFS >>=20 >> 2. Some sort of shared storage, two machines sharing a JBOD box. >>=20 >> 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >>=20 >> 4. using something else than ZFS, even a different OS if required. >=20 >=20 >=20 > There=E2=80=99s always GlusterFS. > Recently ported to FreeBSD and available as net/gulsterfs (10.3 = recommended, AFAIK). >=20 > At work, we use it on Ubuntu - but not with so much data. > On Linux, I=E2=80=99d use it on top of XFS. >=20 > For our Cloud-Storage, we went with ScaleIO (which is Linux only). >=20 > You need more than two nodes with Gluster, though (for production use) > I think my co-worker said four at least. Yeah, it is interesting, but as you say, you really create a RAID5 setup = at least. >=20 > If you have the money and don=E2=80=99t mind Linux, ScaleIO is = probably the best you can buy at the moment. > While licensed at the GByte-Level (yeah, EMC=E2=80=A6) it can be used = free of charge, unsupported. Yeah that is definitely an option. We already have an infrastructure based on ZFS, and I am not sure I do = trust ZFS on Linux? Palle --Apple-Mail=_7E2DA032-0BE8-495C-95AE-5A80E8AB857A Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQEcBAEBCAAGBQJXOk1CAAoJEDQn0sf36UlsT6MIALIVeD599KCAcpNS0ogBGxK3 a9SOuA/eUUTrsuMqrbvxdBWlvzmOF3IkalgjRpVzuCrup2Ukaq7qxpMPmxVBXylM dNDiLi6aVU++vlfBbnTJRBrY8HNG2ZhCKBd+r83gCyo6SAPOoYtHEUjLZC/OYhVv MmBCarS41VD1c/VvildV0inJtLwPeK/ltQb4V39DBuMGKoDYq//cPJqzw4PoPns1 i+M8JS7r70AAanh6QO73gUHr3cTvztwVVNPgjROWispmboZ/Hh6im+dyWwmegDSW JMV8nlUPG2urq1HukcY1poV3OdY/sWVuO8X8t4F1thEOCeLF5wsf8aL1PhGaXCA= =0v7N -----END PGP SIGNATURE----- --Apple-Mail=_7E2DA032-0BE8-495C-95AE-5A80E8AB857A-- From owner-freebsd-fs@freebsd.org Mon May 16 22:47:13 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9328EB3DAD8 for ; Mon, 16 May 2016 22:47:13 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: from connect.ultra-secure.de (connect.ultra-secure.de [88.198.71.201]) by mx1.freebsd.org (Postfix) with ESMTP id C32CD1F6B; Mon, 16 May 2016 22:47:12 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: (Haraka outbound); Tue, 17 May 2016 00:47:10 +0200 Authentication-Results: connect.ultra-secure.de; iprev=pass; auth=pass (plain); spf=none smtp.mailfrom=ultra-secure.de Received-SPF: None (connect.ultra-secure.de: domain of ultra-secure.de does not designate 217.71.83.52 as permitted sender) receiver=connect.ultra-secure.de; identity=mailfrom; client-ip=217.71.83.52; helo=[192.168.1.200]; envelope-from= Received: from [192.168.1.200] (217-071-083-052.ip-tech.ch [217.71.83.52]) by connect.ultra-secure.de (Haraka/2.6.2-toaster) with ESMTPSA id D4AC038E-61DD-4A34-A05C-6796C46862BF.1 envelope-from (authenticated bits=0) (version=TLSv1/SSLv3 cipher=AES256-SHA verify=NO); Tue, 17 May 2016 00:47:08 +0200 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Best practice for high availability ZFS pool From: Rainer Duffner In-Reply-To: Date: Tue, 17 May 2016 00:47:07 +0200 Cc: freebsd-fs@freebsd.org Message-Id: References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <284D58D1-1C62-4519-A46B-7D0E8326B86B@ultra-secure.de> To: Palle Girgensohn X-Mailer: Apple Mail (2.3124) X-Haraka-GeoIP: EU, CH, 451km X-Haraka-ASN: 24951 X-Haraka-GeoIP-Received: X-Haraka-ASN: 24951 217.71.80.0/20 X-Haraka-ASN-CYMRU: asn=24951 net=217.71.80.0/20 country=CH assignor=ripencc date=2003-08-07 X-Haraka-FCrDNS: 217-071-083-052.ip-tech.ch X-Haraka-p0f: os="Mac OS X " link_type="DSL" distance=13 total_conn=1 shared_ip=N X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on spamassassin X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00, HTML_MESSAGE autolearn=ham autolearn_force=no version=3.4.1 X-Haraka-Karma: score: 6, good: 167, bad: 0, connections: 327, history: 167, asn_score: 101, asn_connections: 112, asn_good: 101, asn_bad: 0, pass:all_good, asn, asn_all_good, relaying Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 22:47:13 -0000 > Am 17.05.2016 um 00:44 schrieb Palle Girgensohn : >=20 >>=20 >=20 > We already have an infrastructure based on ZFS, and I am not sure I do = trust ZFS on Linux? Wouldn=E2=80=99t start with a 20T pool on that one, TBH ;-) There are probably a lot of quirks and workarounds needed that only = those who=E2=80=99ve run it for a long time are aware of (if they=E2=80=99= re actually aware of them at all). That said, I=E2=80=99ve run into my own problems with zfs send = now=E2=80=A6.but only on 10.3. Rainer= From owner-freebsd-fs@freebsd.org Mon May 16 22:50:08 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D956AB3DBC4 for ; Mon, 16 May 2016 22:50:08 +0000 (UTC) (envelope-from girgen@pingpong.net) Received: from mail.pingpong.net (mail.pingpong.net [79.136.116.202]) by mx1.freebsd.org (Postfix) with ESMTP id 6BF39109F; Mon, 16 May 2016 22:50:07 +0000 (UTC) (envelope-from girgen@pingpong.net) Received: from [10.0.1.14] (h-155-4-128-242.na.cust.bahnhof.se [155.4.128.242]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.pingpong.net (Postfix) with ESMTPSA id 15F0816B6F; Tue, 17 May 2016 00:50:07 +0200 (CEST) Mime-Version: 1.0 (1.0) Subject: Re: Best practice for high availability ZFS pool From: Palle Girgensohn X-Mailer: iPhone Mail (13E238) In-Reply-To: Date: Tue, 17 May 2016 00:50:06 +0200 Cc: Palle Girgensohn , freebsd-fs@freebsd.org Message-Id: <726D88E6-A1DF-4E5A-ACFF-8A11E6EB3916@pingpong.net> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <284D58D1-1C62-4519-A46B-7D0E8326B86B@ultra-secure.de> To: Rainer Duffner Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 22:50:08 -0000 > 17 maj 2016 kl. 00:47 skrev Rainer Duffner : >=20 >=20 >>> Am 17.05.2016 um 00:44 schrieb Palle Girgensohn : >>>=20 >>=20 >> We already have an infrastructure based on ZFS, and I am not sure I do tr= ust ZFS on Linux? >=20 >=20 >=20 >=20 > Wouldn=E2=80=99t start with a 20T pool on that one, TBH ;-) >=20 > There are probably a lot of quirks and workarounds needed that only those w= ho=E2=80=99ve run it for a long time are aware of (if they=E2=80=99re actual= ly aware of them at all). >=20 >=20 > That said, I=E2=80=99ve run into my own problems with zfs send now=E2=80=A6= .but only on 10.3. >=20 We are still 10.2. Are there, in your opinion, regressions in 10.3 for zfs s= end?= From owner-freebsd-fs@freebsd.org Mon May 16 23:02:35 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8EED1B3E04B for ; Mon, 16 May 2016 23:02:35 +0000 (UTC) (envelope-from nonesuch@longcount.org) Received: from mail-io0-x231.google.com (mail-io0-x231.google.com [IPv6:2607:f8b0:4001:c06::231]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 5247A1EE3 for ; Mon, 16 May 2016 23:02:35 +0000 (UTC) (envelope-from nonesuch@longcount.org) Received: by mail-io0-x231.google.com with SMTP id f89so2240066ioi.0 for ; Mon, 16 May 2016 16:02:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=longcount-org.20150623.gappssmtp.com; s=20150623; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=qXvAPASJ80O1BKmGkDkUgKc4WpmA27h4JISfn/yv4J8=; b=LoVMnaqMDjfHLShd1TI0C/cNkD3RlfyclnCzIppSm2DR84bKibuPtEord088D2Bl+4 WE+mHvNsMqWLQqfCvz5tMjIgkFPBge4ZBPAecCLJ+IewqL3RMu59/6NrFWDPDhVhT7bp gHqK+QC+9+6tAp1wFVfGBWLtVsy7bcr9tjp3qfRYX2zDpQku6D3JaHh4hBF+YUz2Mz9J YatnbHCRqHvmtEcRbDJO8gnYgyJxpXzPgxt6ZC3P2HeTO8i7CCep67m8LeEklmCPx7PX 6Ed0w9XLlIqGQr6EJJGtm9CoBJ8Di08KnMfITKgaE2K1sLTna8WpB147elFsiSD2wC+h PFCg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=qXvAPASJ80O1BKmGkDkUgKc4WpmA27h4JISfn/yv4J8=; b=gPhyAagvN+v6ux1lFxu11yTgSzFrjWfpMWQASSmjH39QGosJHn+IecSSpVStQB35Qg XGYAq7GNW3nhykjbZlHVqg1iAkAPt1Vt9IncjdN/ISl+hgzEOK5tTphhjRYorhRERkNh YtGS+qPVrFdGnwu8+uzdyJMiF03jC0v0Em5TOCWkuxC/KKaRtRLgfKGfLrVv1OUuT68J 51S1xpr/LZsjkpTKsmeIkgK8L14dEIShVq5zrt8WZkUntzFjEbTFUmdS9juB2KbY2xBo eeHLOLi3J6+GSA/MhEwVHC0cJJar/tRjkqRkwh1Q8kjfKNBLSbiGp/coDVEj3dXgOmh0 4D7A== X-Gm-Message-State: AOPr4FX1F97w9iGv7lZG+aZzCeVRRnrcAZSFiWLxhq9KUesA9eQmnUshMMiLCG2Hu3IX9g== X-Received: by 10.107.10.208 with SMTP id 77mr22206726iok.51.1463439754659; Mon, 16 May 2016 16:02:34 -0700 (PDT) Received: from [100.85.18.225] (153.sub-70-214-103.myvzw.com. [70.214.103.153]) by smtp.gmail.com with ESMTPSA id uh3sm115304igb.3.2016.05.16.16.02.33 (version=TLSv1/SSLv3 cipher=OTHER); Mon, 16 May 2016 16:02:33 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (1.0) Subject: Re: Best practice for high availability ZFS pool From: Mark Saad X-Mailer: iPhone Mail (13E238) In-Reply-To: <726D88E6-A1DF-4E5A-ACFF-8A11E6EB3916@pingpong.net> Date: Mon, 16 May 2016 19:02:32 -0400 Cc: Rainer Duffner , freebsd-fs@freebsd.org, Palle Girgensohn Content-Transfer-Encoding: quoted-printable Message-Id: <2135669E-BA6F-4D1F-B865-33D40E74CF51@longcount.org> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <284D58D1-1C62-4519-A46B-7D0E8326B86B@ultra-secure.de> <726D88E6-A1DF-4E5A-ACFF-8A11E6EB3916@pingpong.net> To: Palle Girgensohn X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 23:02:35 -0000 > On May 16, 2016, at 6:50 PM, Palle Girgensohn wrote:= >=20 >=20 >=20 >> 17 maj 2016 kl. 00:47 skrev Rainer Duffner : >>=20 >>=20 >>>> Am 17.05.2016 um 00:44 schrieb Palle Girgensohn : >>>=20 >>> We already have an infrastructure based on ZFS, and I am not sure I do t= rust ZFS on Linux? >>=20 >>=20 >>=20 >>=20 >> Wouldn=E2=80=99t start with a 20T pool on that one, TBH ;-) >>=20 >> There are probably a lot of quirks and workarounds needed that only those= who=E2=80=99ve run it for a long time are aware of (if they=E2=80=99re actu= ally aware of them at all). >>=20 >>=20 >> That said, I=E2=80=99ve run into my own problems with zfs send now=E2=80=A6= .but only on 10.3. >=20 > We are still 10.2. Are there, in your opinion, regressions in 10.3 for zfs= send? Hi palle So two questions how is your zpool setup ? Are you using a dedicated slog ,= and or l2arc ? What level or Zfs raid are you using ? =20 At work we use leofs on top of zfs . Works well has good relocation and spee= d , but it's an s3 work like not a general purpose fs .=20 --- Mark Saad | nonesuch@longcount.org > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@freebsd.org Mon May 16 23:06:37 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 48C8BB3E1A9 for ; Mon, 16 May 2016 23:06:37 +0000 (UTC) (envelope-from 000.fbsd@quip.cz) Received: from elsa.codelab.cz (elsa.codelab.cz [94.124.105.4]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0DEBB1111; Mon, 16 May 2016 23:06:36 +0000 (UTC) (envelope-from 000.fbsd@quip.cz) Received: from elsa.codelab.cz (localhost [127.0.0.1]) by elsa.codelab.cz (Postfix) with ESMTP id 2E8CC28426; Tue, 17 May 2016 01:00:40 +0200 (CEST) Received: from illbsd.quip.test (ip-86-49-16-209.net.upcbroadband.cz [86.49.16.209]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by elsa.codelab.cz (Postfix) with ESMTPSA id 29EC528412; Tue, 17 May 2016 01:00:39 +0200 (CEST) Message-ID: <573A5116.3090302@quip.cz> Date: Tue, 17 May 2016 01:00:38 +0200 From: Miroslav Lachman <000.fbsd@quip.cz> User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:35.0) Gecko/20100101 Firefox/35.0 SeaMonkey/2.32 MIME-Version: 1.0 To: Palle Girgensohn , Borja Marcos CC: freebsd-fs@freebsd.org Subject: Re: Best practice for high availability ZFS pool References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <89D73122-FAC7-4449-AAB3-C4BBE74B960A@FreeBSD.org> In-Reply-To: <89D73122-FAC7-4449-AAB3-C4BBE74B960A@FreeBSD.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 23:06:37 -0000 Palle Girgensohn wrote on 05/17/2016 00:36: > >> 16 maj 2016 kl. 15:51 skrev Borja Marcos : >> >> >>> On 16 May 2016, at 12:08, Palle Girgensohn wrote: >>> >>> Hi, >>> >>> We need to set up a ZFS pool with redundance. The main goal is high availability - uptime. >>> >>> I can see a few of paths to follow. >>> >>> 1. HAST + ZFS >> >> Which means that a possible corruption causing bug in ZFS would vaporize the data of both replicas. >> >>> 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >> >> If you don’t have a hard requirement for synchronous replication (and, in that case, I would opt for a more application >> aware approach) it’s the best method in my opinion. > > That was exactly my thought 18 months ago, and we set up two systems with zfs snapshot + zfs send | ssh | zfs receive. It works, but the problem is it just too slow and a complete sync takes like 10 minutes for all the file systems. We are forced to sync the file systems one at a time to get the kind of control and separation we need. Even if we could speed that up somehow, we are really looking for a more recilient system. Also, constant snapshotting and writing makes scrub very slow so we need to tune down the amount of syncing every fourth week-end to scrub. It's OK but not optimal, so we're pondering for something better. > > My first choice is really HAST at the moment, but I also dont find much written in the last couple of years, apart from some articles about setting it up in very minimal testbeds or posts about performance and stability troubles. This makes me wonder, is HAST actively maintained? Is it stable, used and loved by the community? I'd love to hear some success stories with farily large installations of at least 20 TB or so. I am not using HAST personally but I read about success with HAST and ZFS somewhere in FreeBSD mailing lists. I don't have a direct link / bookmark for it. Maybe you will find it thru search engine. Miroslav Lachman From owner-freebsd-fs@freebsd.org Mon May 16 23:07:30 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8A86CB3E23B for ; Mon, 16 May 2016 23:07:30 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: from connect.ultra-secure.de (connect.ultra-secure.de [88.198.71.201]) by mx1.freebsd.org (Postfix) with ESMTP id E79E511DE for ; Mon, 16 May 2016 23:07:29 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: (Haraka outbound); Tue, 17 May 2016 01:07:28 +0200 Authentication-Results: connect.ultra-secure.de; iprev=pass; auth=pass (plain); spf=none smtp.mailfrom=ultra-secure.de Received-SPF: None (connect.ultra-secure.de: domain of ultra-secure.de does not designate 217.71.83.52 as permitted sender) receiver=connect.ultra-secure.de; identity=mailfrom; client-ip=217.71.83.52; helo=[192.168.1.200]; envelope-from= Received: from [192.168.1.200] (217-071-083-052.ip-tech.ch [217.71.83.52]) by connect.ultra-secure.de (Haraka/2.6.2-toaster) with ESMTPSA id D0846A73-60AD-4F3A-841F-6946D77246BB.1 envelope-from (authenticated bits=0) (version=TLSv1/SSLv3 cipher=AES256-SHA verify=NO); Tue, 17 May 2016 01:07:26 +0200 From: Rainer Duffner Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: zfs receive stalls whole system Message-Id: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> Date: Tue, 17 May 2016 01:07:24 +0200 To: FreeBSD Filesystems Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) X-Mailer: Apple Mail (2.3124) X-Haraka-GeoIP: EU, CH, 451km X-Haraka-ASN: 24951 X-Haraka-GeoIP-Received: X-Haraka-ASN: 24951 217.71.80.0/20 X-Haraka-ASN-CYMRU: asn=24951 net=217.71.80.0/20 country=CH assignor=ripencc date=2003-08-07 X-Haraka-FCrDNS: 217-071-083-052.ip-tech.ch X-Haraka-p0f: os="Mac OS X " link_type="DSL" distance=13 total_conn=2 shared_ip=N X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on spamassassin X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no version=3.4.1 X-Haraka-Karma: score: 6, good: 168, bad: 0, connections: 328, history: 168, asn_score: 102, asn_connections: 113, asn_good: 102, asn_bad: 0, pass:all_good, asn, asn_all_good, relaying X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 23:07:30 -0000 Hi, I have two servers, that were running FreeBSD 10.1-AMD64 for a long = time, one zfs-sending to the other (via zxfer). Both are NFS-servers and = MySQL-slaves, the sender is actively used as NFS-server, the recipient = is just a warm-standby, in case something serious happens and we don=E2=80= =99t want to wait for a day until the restore is back in place. The = MySQL-Slaves are actively used as read-only servers (at the application = level, Python=E2=80=99s SQL-Alchemy does that, apparently). They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think = one has 144, the other has 192). While they were running 10.1, they used HP P420 RAID-controllers with = individual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs. I use zfsnap to do hourly, daily and weekly snapshots. Sending worked well, especially after updating to 10.1 Because the storage was over 90% full (and I really hate this = RAID0-business we have with the HP RAID controllers), I rebuilt the = servers with HPs OEMed H220/221 controllers (LSI 2308 in disguise) and = an external disk shelf, hosting 12 additional disks was added- and I = upgraded to FreeBSD 10.3. Because we didn=E2=80=99t want to throw out the original disks, but = increase available space a lot, the new disks are double the size of the = original disks (600 vs. 1200 GB SAS).=20 I also created GPT-partitions on the disks and labeled them according to = the disk=E2=80=99s position in the cages/shelf, created the pools with = the got-partition-names instead of the daX-names. Now, when I do a zxfer, sometimes the whole system stalls while the data = is sent over, especially if the delta is large or if something else is = reading from the disk at the same time (backup agent). I had this before, on 10.0 (I believe, we didn=E2=80=99t have this in = 9.1 either, IIRC) and it went away in 10.1. It=E2=80=99s very difficult (well, impossible) to debug, because the = system totally hangs and doesn=E2=80=99t accept any keypresses. Would a ZIL help in this case? I always thought that NFS was the only thing that did SYNC writes=E2=80=A6= From owner-freebsd-fs@freebsd.org Mon May 16 23:14:25 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 86A07B3E393 for ; Mon, 16 May 2016 23:14:25 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: from connect.ultra-secure.de (connect.ultra-secure.de [88.198.71.201]) by mx1.freebsd.org (Postfix) with ESMTP id E65151647 for ; Mon, 16 May 2016 23:14:24 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: (Haraka outbound); Tue, 17 May 2016 01:14:23 +0200 Authentication-Results: connect.ultra-secure.de; iprev=pass; auth=pass (plain); spf=none smtp.mailfrom=ultra-secure.de Received-SPF: None (connect.ultra-secure.de: domain of ultra-secure.de does not designate 217.71.83.52 as permitted sender) receiver=connect.ultra-secure.de; identity=mailfrom; client-ip=217.71.83.52; helo=[192.168.1.200]; envelope-from= Received: from [192.168.1.200] (217-071-083-052.ip-tech.ch [217.71.83.52]) by connect.ultra-secure.de (Haraka/2.6.2-toaster) with ESMTPSA id C3F38CD4-6758-439C-B896-0AEB07043CA5.1 envelope-from (authenticated bits=0) (version=TLSv1/SSLv3 cipher=AES256-SHA verify=NO); Tue, 17 May 2016 01:14:21 +0200 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: zfs receive stalls whole system From: Rainer Duffner In-Reply-To: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> Date: Tue, 17 May 2016 01:14:19 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <1513A05F-1DA7-4765-A67C-360555C97CF0@ultra-secure.de> References: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> To: FreeBSD Filesystems X-Mailer: Apple Mail (2.3124) X-Haraka-GeoIP: EU, CH, 451km X-Haraka-ASN: 24951 X-Haraka-GeoIP-Received: X-Haraka-ASN: 24951 217.71.80.0/20 X-Haraka-ASN-CYMRU: asn=24951 net=217.71.80.0/20 country=CH assignor=ripencc date=2003-08-07 X-Haraka-FCrDNS: 217-071-083-052.ip-tech.ch X-Haraka-p0f: os="Mac OS X " link_type="DSL" distance=13 total_conn=3 shared_ip=N X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on spamassassin X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00 autolearn=ham autolearn_force=no version=3.4.1 X-Haraka-Karma: score: 6, good: 169, bad: 0, connections: 329, history: 169, asn_score: 103, asn_connections: 114, asn_good: 103, asn_bad: 0, pass:all_good, asn, asn_all_good, relaying X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 May 2016 23:14:25 -0000 >=20 > Would a ZIL help in this case? > I always thought that NFS was the only thing that did SYNC writes=E2=80=A6= >=20 I mean an SSD-based SLOG device, for the record. Because I=E2=80=99ve already maxed out my three PCIe slots I have with = the single CPU, my only option would be a DC S3710, it seems. Only problems is I have to mount it into a HP disk-tray, somehow, which = I=E2=80=99ve never tried. Also, it would sit on the same SAS-backplane as the other disks, which = may or may not be a good thing...= From owner-freebsd-fs@freebsd.org Tue May 17 01:36:13 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E94A6B3DBE4 for ; Tue, 17 May 2016 01:36:13 +0000 (UTC) (envelope-from m.e.sanliturk@gmail.com) Received: from mail-oi0-x22e.google.com (mail-oi0-x22e.google.com [IPv6:2607:f8b0:4003:c06::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id BAFF71067; Tue, 17 May 2016 01:36:13 +0000 (UTC) (envelope-from m.e.sanliturk@gmail.com) Received: by mail-oi0-x22e.google.com with SMTP id v145so2863560oie.0; Mon, 16 May 2016 18:36:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=1JFEatSlMO0Hp7F9c31GY/oDeD0LzVNEFL5AthZjwSU=; b=nEIK1S5SqHjIBDlPaaJI4lgJC9n6GcFNtMGco7vikDTf+XJ6oo/VYIy/Ev6pNY6xXP lGtKRFbg8/zhHWfCx5iieX9NmZUWkKO49gJMeze31tp/q5P5h5RmX4Bu32Tw9cwrCymI O5n+IorMpA3OrnQMs02tOiBMynuKDdvG7tQp9qGtRHBPEtAHng89nfhdLHurKZH42gOB IxWPL0d/HPoRP5jOWeoSVuGXuhDB3TajB4t63uzQWIYhRhaffez+B1R4p4ozlF3XZCqU dEQc5to/hlB9j6CVEIzhtGcJDZhnDrUhvKFoSdX7+a19MM3Inu0qxPwl1WQ9Mj4kx0Iy +oag== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=1JFEatSlMO0Hp7F9c31GY/oDeD0LzVNEFL5AthZjwSU=; b=nHa3ywsmK0wD+HUckxYV/PiVEepQOx0fASmRusxFELsZDV4OBmaf6it0dudeDm0GLY /DoGlj0JwbvICBbyAVgGs32REJ0F0Irzh609nzI00r1UHtfnmNT+6tnL34k806FiRKec OiAUgdC7tzKu2y7fd+ytEwmkiJIZTRpWpuwH+1kOrEdXIeG+JXkvRVhnzZFhItSP3GSY l3anbjxhjwFW0MRxXTQA408+y+FZ65kJFf3E8uoAe49BWD4474kFYSBYKf0gZpGKpeFD 9jUf34Z+HKwdspYOJz10hvbGss4SNvwYxeMeL4oI6iR2yem17s+rvoUubp/tDaXG7mNc hjXQ== X-Gm-Message-State: AOPr4FWQo+LAzjsjggiUjMkDniystn4TU/ev8k46y1cWAJNftsCN18cd/9fZAUpA9oWsOPMGxzx1mDqsSfGctQ== MIME-Version: 1.0 X-Received: by 10.202.222.197 with SMTP id v188mr16403551oig.82.1463448972892; Mon, 16 May 2016 18:36:12 -0700 (PDT) Received: by 10.157.45.131 with HTTP; Mon, 16 May 2016 18:36:12 -0700 (PDT) In-Reply-To: <573A5116.3090302@quip.cz> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <89D73122-FAC7-4449-AAB3-C4BBE74B960A@FreeBSD.org> <573A5116.3090302@quip.cz> Date: Mon, 16 May 2016 18:36:12 -0700 Message-ID: Subject: Re: Best practice for high availability ZFS pool From: Mehmet Erol Sanliturk To: Miroslav Lachman <000.fbsd@quip.cz> Cc: Palle Girgensohn , Borja Marcos , freebsd-fs@freebsd.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 01:36:14 -0000 On Mon, May 16, 2016 at 4:00 PM, Miroslav Lachman <000.fbsd@quip.cz> wrote: > Palle Girgensohn wrote on 05/17/2016 00:36: > >> >> 16 maj 2016 kl. 15:51 skrev Borja Marcos : >>> >>> >>> On 16 May 2016, at 12:08, Palle Girgensohn wrote: >>>> >>>> Hi, >>>> >>>> We need to set up a ZFS pool with redundance. The main goal is high >>>> availability - uptime. >>>> >>>> I can see a few of paths to follow. >>>> >>>> 1. HAST + ZFS >>>> >>> >>> Which means that a possible corruption causing bug in ZFS would vaporiz= e >>> the data of both replicas. >>> >>> 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >>>> >>> >>> If you don=E2=80=99t have a hard requirement for synchronous replicatio= n (and, >>> in that case, I would opt for a more application >>> aware approach) it=E2=80=99s the best method in my opinion. >>> >> >> That was exactly my thought 18 months ago, and we set up two systems wit= h >> zfs snapshot + zfs send | ssh | zfs receive. It works, but the problem i= s >> it just too slow and a complete sync takes like 10 minutes for all the f= ile >> systems. We are forced to sync the file systems one at a time to get the >> kind of control and separation we need. Even if we could speed that up >> somehow, we are really looking for a more recilient system. Also, consta= nt >> snapshotting and writing makes scrub very slow so we need to tune down t= he >> amount of syncing every fourth week-end to scrub. It's OK but not optima= l, >> so we're pondering for something better. >> >> My first choice is really HAST at the moment, but I also dont find much >> written in the last couple of years, apart from some articles about sett= ing >> it up in very minimal testbeds or posts about performance and stability >> troubles. This makes me wonder, is HAST actively maintained? Is it stabl= e, >> used and loved by the community? I'd love to hear some success stories w= ith >> farily large installations of at least 20 TB or so. >> > > I am not using HAST personally but I read about success with HAST and ZFS > somewhere in FreeBSD mailing lists. I don't have a direct link / bookmark > for it. Maybe you will find it thru search engine. > > Miroslav Lachman > _______________________________________________ > f > If you search HAST and ZFS in Google , it will provide a long list of possible related pages . Mehmet Erol Sanliturk From owner-freebsd-fs@freebsd.org Tue May 17 01:48:14 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5F369B3DFC1 for ; Tue, 17 May 2016 01:48:14 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from smtp.simplesystems.org (smtp.simplesystems.org [65.66.246.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 2F58B1744; Tue, 17 May 2016 01:48:13 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from freddy.simplesystems.org (freddy.simplesystems.org [65.66.246.65]) by smtp.simplesystems.org (8.14.4+Sun/8.14.4) with ESMTP id u4H1hnHC008304; Mon, 16 May 2016 20:43:49 -0500 (CDT) Date: Mon, 16 May 2016 20:43:49 -0500 (CDT) From: Bob Friesenhahn X-X-Sender: bfriesen@freddy.simplesystems.org To: Palle Girgensohn cc: freebsd-fs@freebsd.org Subject: Re: Best practice for high availability ZFS pool In-Reply-To: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> Message-ID: References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> User-Agent: Alpine 2.20 (GSO 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (smtp.simplesystems.org [65.66.246.90]); Mon, 16 May 2016 20:43:49 -0500 (CDT) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 01:48:14 -0000 On Mon, 16 May 2016, Palle Girgensohn wrote: > > Shared storage still has a single point of failure, the JBOD box. > Apart from that, is there even any support for the kind of storage > PCI cards that support dual head for a storage box? I cannot find > any. Use two (or three) JBOD boxes and do simple zfs mirroring across them so you can unplug a JBOD and the pool still works. Or use a bunch of JBOD boxes and use zfs raidz2 (or raidz3) across them with careful LUN selection so there is total storage redundancy and you can unplug a JBOD and the pool still works. Fiber channel (or FCoE) or iSCSI allows putting the hardware at some distance. Without completely isolated systems there is always the risk of total failure. Even with zfs send there is the risk of total failure if the sent data results in corruption on the receiving side. Decide if you really want to optimize for maximum availability or you want to minimize the duration of the outage if something goes wrong. There is a difference. Bob -- Bob Friesenhahn bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From owner-freebsd-fs@freebsd.org Tue May 17 01:51:28 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C7908B3E164 for ; Tue, 17 May 2016 01:51:28 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from smtp.simplesystems.org (smtp.simplesystems.org [65.66.246.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 9986719AF for ; Tue, 17 May 2016 01:51:28 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from freddy.simplesystems.org (freddy.simplesystems.org [65.66.246.65]) by smtp.simplesystems.org (8.14.4+Sun/8.14.4) with ESMTP id u4H1pQeY008619; Mon, 16 May 2016 20:51:27 -0500 (CDT) Date: Mon, 16 May 2016 20:51:26 -0500 (CDT) From: Bob Friesenhahn X-X-Sender: bfriesen@freddy.simplesystems.org To: Rainer Duffner cc: FreeBSD Filesystems Subject: Re: zfs receive stalls whole system In-Reply-To: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> Message-ID: References: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> User-Agent: Alpine 2.20 (GSO 67 2015-01-07) MIME-Version: 1.0 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (smtp.simplesystems.org [65.66.246.90]); Mon, 16 May 2016 20:51:27 -0500 (CDT) Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8BIT X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 01:51:28 -0000 On Tue, 17 May 2016, Rainer Duffner wrote: > > It’s very difficult (well, impossible) to debug, because the system > totally hangs and doesn’t accept any keypresses. > > Would a ZIL help in this case? > I always thought that NFS was the only thing that did SYNC writes
 This sounds like a hardware or driver problem. A dedicated ZIL won't help a system which entirely hangs. Bob -- Bob Friesenhahn bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From owner-freebsd-fs@freebsd.org Tue May 17 03:59:57 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 34572B3EDD1 for ; Tue, 17 May 2016 03:59:57 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: from connect.ultra-secure.de (connect.ultra-secure.de [88.198.71.201]) by mx1.freebsd.org (Postfix) with ESMTP id 91AB4115F for ; Tue, 17 May 2016 03:59:56 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: (Haraka outbound); Tue, 17 May 2016 05:59:54 +0200 Authentication-Results: connect.ultra-secure.de; iprev=pass; auth=pass (plain); spf=none smtp.mailfrom=ultra-secure.de Received-SPF: None (connect.ultra-secure.de: domain of ultra-secure.de does not designate 217.71.83.52 as permitted sender) receiver=connect.ultra-secure.de; identity=mailfrom; client-ip=217.71.83.52; helo=[192.168.1.200]; envelope-from= Received: from [192.168.1.200] (217-071-083-052.ip-tech.ch [217.71.83.52]) by connect.ultra-secure.de (Haraka/2.6.2-toaster) with ESMTPSA id DBC21CD3-3F3C-4C27-B0CE-EA8A86995E60.1 envelope-from (authenticated bits=0) (version=TLSv1/SSLv3 cipher=AES256-SHA verify=NO); Tue, 17 May 2016 05:59:47 +0200 Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: zfs receive stalls whole system From: Rainer Duffner In-Reply-To: Date: Tue, 17 May 2016 05:59:45 +0200 Cc: FreeBSD Filesystems Content-Transfer-Encoding: quoted-printable Message-Id: <3E271E07-F60E-4181-B8B0-9ED2CFCDF5A0@ultra-secure.de> References: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> To: Bob Friesenhahn X-Mailer: Apple Mail (2.3124) X-Haraka-GeoIP: EU, CH, 451km X-Haraka-ASN: 24951 X-Haraka-GeoIP-Received: X-Haraka-ASN: 24951 217.71.80.0/20 X-Haraka-ASN-CYMRU: asn=24951 net=217.71.80.0/20 country=CH assignor=ripencc date=2003-08-07 X-Haraka-FCrDNS: 217-071-083-052.ip-tech.ch X-Haraka-p0f: os="Mac OS X " link_type="DSL" distance=13 total_conn=1 shared_ip=N X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on spamassassin X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.1 X-Haraka-Karma: score: 6, good: 170, bad: 0, connections: 330, history: 170, asn_score: 104, asn_connections: 115, asn_good: 104, asn_bad: 0, pass:all_good, asn, asn_all_good, relaying X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 03:59:57 -0000 > Am 17.05.2016 um 03:51 schrieb Bob Friesenhahn = : >=20 > On Tue, 17 May 2016, Rainer Duffner wrote: >>=20 >> It=E2=80=99s very difficult (well, impossible) to debug, because the = system totally hangs and doesn=E2=80=99t accept any keypresses. >>=20 >> Would a ZIL help in this case? >> I always thought that NFS was the only thing that did SYNC writes=E2=80= =A6 >=20 > This sounds like a hardware or driver problem. A dedicated ZIL won't = help a system which entirely hangs. When I rebuilt these systems, I started with the 2nd one, the = standby-system. I zfs sent 5 or 6T worth of data from the original system to it and it = was very fast. I got 600 MBit flat out of it. Then, I made that system master while I rebuilt the other one. When I synced back, I got maybe 500MBit on the zfs sends. And I started to see these stalls on sending updates. Could this be a problem: (nfs2-prod ) 1 # sysctl -a |grep mps |grep = "driver_version\|firmware_version" dev.mps.2.driver_version: 20.00.00.00-fbsd dev.mps.2.firmware_version: 15.10.01.00 dev.mps.1.driver_version: 20.00.00.00-fbsd dev.mps.1.firmware_version: 13.10.53.00 dev.mps.0.driver_version: 20.00.00.00-fbsd dev.mps.0.firmware_version: 13.10.53.00 As per this thread: = https://forums.freenas.org/index.php?threads/9-3-1-update-with-alert-firmw= are-version-16-does-not-match-driver-version-20-for-dev-mps0.36536/page-4 I will have to ask HP for a newer firmware then=E2=80=A6 From owner-freebsd-fs@freebsd.org Tue May 17 07:47:17 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D4384B3E79D for ; Tue, 17 May 2016 07:47:17 +0000 (UTC) (envelope-from jg@internetx.com) Received: from mx1.internetx.com (mx1.internetx.com [62.116.129.39]) by mx1.freebsd.org (Postfix) with ESMTP id 98EAB1DA6; Tue, 17 May 2016 07:47:17 +0000 (UTC) (envelope-from jg@internetx.com) Received: from localhost (localhost [127.0.0.1]) by mx1.internetx.com (Postfix) with ESMTP id A015F45FC0CD; Tue, 17 May 2016 09:41:50 +0200 (CEST) X-Virus-Scanned: InterNetX GmbH amavisd-new at ix-mailer.internetx.de Received: from mx1.internetx.com ([62.116.129.39]) by localhost (ix-mailer.internetx.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id zDk391QzatOV; Tue, 17 May 2016 09:41:48 +0200 (CEST) Received: from [192.168.100.26] (pizza.internetx.de [62.116.129.3]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mx1.internetx.com (Postfix) with ESMTPSA id 602374C4C754; Tue, 17 May 2016 09:41:48 +0200 (CEST) Subject: Re: Best practice for high availability ZFS pool References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> To: Palle Girgensohn , freebsd-fs@freebsd.org From: InterNetX - Juergen Gotteswinter Reply-To: jg@internetx.com Message-ID: Date: Tue, 17 May 2016 09:41:44 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 07:47:17 -0000 Hi, Am 5/16/2016 um 12:08 PM schrieb Palle Girgensohn: > Hi, > > We need to set up a ZFS pool with redundance. The main goal is high availability - uptime. > > I can see a few of paths to follow. > > 1. HAST + ZFS dont do this, this has already been discussed some time ago. afaik nothing changed until this https://lists.freebsd.org/pipermail/freebsd-fs/2014-October/020084.html > > 2. Some sort of shared storage, two machines sharing a JBOD box. take care when choosing sas hba and expander, avoid sata behind sas with dual expander jbods you will be able to build an ha setup, but i highly recommend to avoid any home brew solutions. go for rsf-1. > > 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) > > 4. using something else than ZFS, even a different OS if required. > > My main concern with HAST+ZFS is performance. Google offer some insights here, I find mainly unsolved problems. Please share any success stories or other experiences. > performance isnt the real problem, check the older discussion mentioned above. > Shared storage still has a single point of failure, the JBOD box. Apart from that, is there even any support for the kind of storage PCI cards that support dual head for a storage box? I cannot find any. > the jbods are just a dumb piece of metal with an expander mounted. so far, i never had a broken one. > We are running with ZFS replication today, but it is just too slow for the amount of data. > replicate more often to keep the delta between each snapshot as small as possible? maybe even 10G crosslink if possible? > We prefer to keep ZFS as we already have a rather big (~30 TB) pool and also tools, scripts, backup all is using ZFS, but if there is no solution using ZFS, we're open to alternatives. Nexenta springs to mind, but I believe it is using shared storage for redundance, so it does have single points of failure? > > Any other suggestions? Please share your experience. :) > > Palle > From owner-freebsd-fs@freebsd.org Tue May 17 07:56:28 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F0982B3EBEA for ; Tue, 17 May 2016 07:56:28 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id E1A3E1953 for ; Tue, 17 May 2016 07:56:28 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4H7uSBY018300 for ; Tue, 17 May 2016 07:56:28 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209093] ZFS snapshot rename : .zfs/snapshot messes up Date: Tue, 17 May 2016 07:56:28 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: commit-hook@freebsd.org X-Bugzilla-Status: Open X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: avg@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 07:56:29 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209093 --- Comment #3 from commit-hook@freebsd.org --- A commit references this bug: Author: avg Date: Tue May 17 07:56:05 UTC 2016 New revision: 300024 URL: https://svnweb.freebsd.org/changeset/base/300024 Log: zfs_ioc_rename: fix a reversed condition FreeBSD zfs_ioc_rename() has an option, not present upstream, that allows to rename snapshots without unmounting them first. I am not sure what is a rationale for that option, but its actual behavior was the opposite of the intended behavior. That is, by default the snapshots were not unmounted. The option was introduced as part of a large update from upstream in r248498. One of the consequences was a havoc under .zfs/snapshot after the rename. The snapshots got new names but were mounted on top of directories with old names, so readdir would list the new names, but lookup would still find the old mounts. PR: 209093 Reported by: Fr?d?ric VANNI?RE MFC after: 5 days Changes: head/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-fs@freebsd.org Tue May 17 07:58:25 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0C77FB3EDAC for ; Tue, 17 May 2016 07:58:25 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id F1A771B98 for ; Tue, 17 May 2016 07:58:24 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4H7wObv021387 for ; Tue, 17 May 2016 07:58:24 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209093] ZFS snapshot rename : .zfs/snapshot messes up Date: Tue, 17 May 2016 07:58:25 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Some People X-Bugzilla-Who: avg@FreeBSD.org X-Bugzilla-Status: In Progress X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: avg@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: bug_status Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 07:58:25 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209093 Andriy Gapon changed: What |Removed |Added ---------------------------------------------------------------------------- Status|Open |In Progress --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-fs@freebsd.org Tue May 17 07:59:44 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AA3D4B3EE1F for ; Tue, 17 May 2016 07:59:44 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 9A5841C97 for ; Tue, 17 May 2016 07:59:44 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: by mailman.ysv.freebsd.org (Postfix) id 99C27B3EE1D; Tue, 17 May 2016 07:59:44 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 994FDB3EE1B; Tue, 17 May 2016 07:59:44 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from cu01176b.smtpx.saremail.com (cu01176b.smtpx.saremail.com [195.16.151.151]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6038B1C96; Tue, 17 May 2016 07:59:43 +0000 (UTC) (envelope-from borjam@sarenet.es) Received: from [172.16.8.36] (izaro.sarenet.es [192.148.167.11]) by proxypop01.sare.net (Postfix) with ESMTPSA id E7CBC9DD374; Tue, 17 May 2016 09:49:46 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: ZFS and NVMe, trim caused stalling From: Borja Marcos In-Reply-To: <5E710EA5-C9B0-4521-85F1-3FE87555B0AF@bsdimp.com> Date: Tue, 17 May 2016 09:49:46 +0200 Cc: fs@freebsd.org, FreeBSD-STABLE Mailing List Content-Transfer-Encoding: quoted-printable Message-Id: References: <5E710EA5-C9B0-4521-85F1-3FE87555B0AF@bsdimp.com> To: Warner Losh X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 07:59:44 -0000 > On 05 May 2016, at 16:39, Warner Losh wrote: >=20 >> What do you think? In some cases it=E2=80=99s clear that TRIM can do = more harm than good. >=20 > I think it=E2=80=99s best we not overreact. I agree. But with this issue the system is almost unusable for now. > This particular case is cause by the nvd driver, not the Intel P3500 = NVME drive. You need > a solution (3): Fix the driver. >=20 > Specifically, ZFS is pushing down a boatload of BIO_DELETE requests. = In ata/da land, these > requests are queued up, then collapsed together as much as makes sense = (or is possible). > This vastly helps performance (even with the extra sorting that I = forced to be in there that I > need to fix before 11). The nvd driver needs to do the same thing. I understand that, but I don=E2=80=99t think it=E2=80=99s a good that = ZFS depends blindly on a driver feature such as that. Of course, it=E2=80=99s great to exploit it. I have also noticed that ZFS has a good throttling mechanism for write = operations. A similar mechanism should throttle trim requests so that trim requests don=E2=80=99= t clog the whole system. > I=E2=80=99d be extremely hesitant to tossing away TRIMs. They are = actually quite important for > the FTL in the drive=E2=80=99s firmware to proper manage the NAND = wear. More free space always > reduces write amplification. It tends to go as 1 / freespace, so = simply dropping them on > the floor should be done with great reluctance. I understand. I was wondering about choosing the lesser between two = evils. A 15 minute I/O stall (I deleted 2 TB of data, that=E2=80=99s a lot, but not so = unrealistic) or settings trims aside during the peak activity. I see that I was wrong on that, as a throttling mechanism would be more = than enough probably, unless the system is close to running out of space. I=E2=80=99ve filed a bug report anyway. And copying to -stable. https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209571 Thanks! Borja. From owner-freebsd-fs@freebsd.org Tue May 17 08:20:57 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B0D63B3E992 for ; Tue, 17 May 2016 08:20:57 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 9BF3E1D7C for ; Tue, 17 May 2016 08:20:57 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id 9B34EB3E991; Tue, 17 May 2016 08:20:57 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9AD48B3E990 for ; Tue, 17 May 2016 08:20:57 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 33C901D7B for ; Tue, 17 May 2016 08:20:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4H8Kohw013266 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 17 May 2016 11:20:51 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4H8Kohw013266 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4H8Ko7G013262; Tue, 17 May 2016 11:20:50 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 17 May 2016 11:20:50 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs Message-ID: <20160517082050.GX89104@kib.kiev.ua> References: <20160517072705.F2157@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160517072705.F2157@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 08:20:57 -0000 On Tue, May 17, 2016 at 07:54:27AM +1000, Bruce Evans wrote: > ffs does very slow shrinking of directories after removing some files > leaves unused blocks at the end, by always doing synchronous truncation. > > This often happens in my normal usage: medium size builds expand /tmp > from 512 to 1024 to hold a few more hundred bytes of file names; > expansion is async and fast, but shrinking is sync and slow, and > with a certain size of build the boundary is crossed back and forth > very often. > > My /tmp directory is always on an async-mounted file system, so this > quick fix of always doing an async truncation for async mounts works > for me. Using IO_SYNC when not asked to is a bug for async mounts > in all cases anyway. > > The file system has block size 8192 and frag size 1024, so it is also > wrong to shrink to size DIRBLKSIZE = 512. The shrinkage seems to be > considered at every DIRBLKSIZE boundary, so not only small directories > are affected. > > The patch fixes an unrelated typo in a message. > > X Index: ufs_lookup.c > X =================================================================== > X --- ufs_lookup.c (revision 299263) > X +++ ufs_lookup.c (working copy) > X @@ -1131,9 +1131,9 @@ > X if (tvp != NULL) > X VOP_UNLOCK(tvp, 0); > X error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff, > X - IO_NORMAL | IO_SYNC, cr); > X + IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr); > X if (error != 0) > X - vprint("ufs_direnter: failted to truncate", dvp); > X + vprint("ufs_direnter: failed to truncate", dvp); > X #ifdef UFS_DIRHASH > X if (error == 0 && dp->i_dirhash != NULL) > X ufsdirhash_dirtrunc(dp, dp->i_endoff); > The IO_SYNC flag, for non-journaled SU and any kind of non-SU mounts, only affects the new blocks allocation mode, and write-out mode for the last fragment. The truncation itself (for -J) is performed in the context of the truncating thread. The cg blocks, after the bits are set to free, are marked for delayed write (with the background write hack). The inode block is written according to the mount mode, ignoring IO_SYNC. That is, for always fully populated directory files, I do not see how anything is changed by the patch. I committed the typo fix. From owner-freebsd-fs@freebsd.org Tue May 17 08:33:28 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 67E1EB3EF9F for ; Tue, 17 May 2016 08:33:28 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay05.ispgateway.de (smtprelay05.ispgateway.de [80.67.31.98]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 2FCAF1F8A for ; Tue, 17 May 2016 08:33:27 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from [78.35.176.77] (helo=fabiankeil.de) by smtprelay05.ispgateway.de with esmtpsa (TLSv1.2:AES128-GCM-SHA256:128) (Exim 4.84) (envelope-from ) id 1b2aPy-0002dW-TL for freebsd-fs@freebsd.org; Tue, 17 May 2016 10:31:54 +0200 Date: Tue, 17 May 2016 10:27:57 +0200 From: Fabian Keil To: FreeBSD Filesystems Subject: Re: zfs receive stalls whole system Message-ID: <20160517102757.135c1468@fabiankeil.de> In-Reply-To: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> References: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/ZojT=4SLUeXeJZEf2IdOajl"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 08:33:28 -0000 --Sig_/ZojT=4SLUeXeJZEf2IdOajl Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Rainer Duffner wrote: > I have two servers, that were running FreeBSD 10.1-AMD64 for a long time,= one zfs-sending to the other (via zxfer). Both are NFS-servers and MySQL-s= laves, the sender is actively used as NFS-server, the recipient is just a w= arm-standby, in case something serious happens and we don=E2=80=99t want to= wait for a day until the restore is back in place. The MySQL-Slaves are ac= tively used as read-only servers (at the application level, Python=E2=80=99= s SQL-Alchemy does that, apparently). >=20 > They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think one= has 144, the other has 192). > While they were running 10.1, they used HP P420 RAID-controllers with ind= ividual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs. > I use zfsnap to do hourly, daily and weekly snapshots. [...] > Now, when I do a zxfer, sometimes the whole system stalls while the data = is sent over, especially if the delta is large or if something else is read= ing from the disk at the same time (backup agent). >=20 > I had this before, on 10.0 (I believe, we didn=E2=80=99t have this in 9.1= either, IIRC) and it went away in 10.1. Do you use geli for swap device(s)? > It=E2=80=99s very difficult (well, impossible) to debug, because the syst= em totally hangs and doesn=E2=80=99t accept any keypresses. You could try reducing ZFS's deadman timeout to get a panic. On systems with local disks I usually use: vfs.zfs.deadman_enabled: 1 vfs.zfs.deadman_checktime_ms: 5000 vfs.zfs.deadman_synctime_ms: 10000 Fabian --Sig_/ZojT=4SLUeXeJZEf2IdOajl Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlc61g4ACgkQBYqIVf93VJ0shgCaA2wnHQq+AKX3XK7yt5jWKHZ/ rUEAn1IMBjKGvRcA9ZljB/Qy7cY0gLAk =TR3y -----END PGP SIGNATURE----- --Sig_/ZojT=4SLUeXeJZEf2IdOajl-- From owner-freebsd-fs@freebsd.org Tue May 17 08:42:47 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 782ABB3D422 for ; Tue, 17 May 2016 08:42:47 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 62F8E18BF for ; Tue, 17 May 2016 08:42:47 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id 61FF9B3D41F; Tue, 17 May 2016 08:42:47 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 61A2DB3D41D for ; Tue, 17 May 2016 08:42:47 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D57BF18BE for ; Tue, 17 May 2016 08:42:46 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4H8gfBE018503 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 17 May 2016 11:42:42 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4H8gfBE018503 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4H8gfYj018502; Tue, 17 May 2016 11:42:41 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 17 May 2016 11:42:41 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs Message-ID: <20160517084241.GY89104@kib.kiev.ua> References: <20160517072104.I2137@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160517072104.I2137@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 08:42:47 -0000 On Tue, May 17, 2016 at 07:26:08AM +1000, Bruce Evans wrote: > Counting of i/o's in g_vfs_strategy() requires the fs to initialize > devvp->v_rdev->si_mountpt to non-null. This seems to be done correctly > in ext2fs and msdosfs, but in ffs it is not done for ro mounts, or for > rw mounts that started as ro. The bug is most obvious for the root > file system since it always starts as ro. I committed the comments updates. For the accounting patch, don't we want to account for all io, including the mount-time metadata reads and initial superblock update ? diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index 9776554..712fc21 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -780,6 +780,8 @@ ffs_mountfs(devvp, mp, td) mp->mnt_iosize_max = MAXPHYS; devvp->v_bufobj.bo_ops = &ffs_ops; + if (devvp->v_type == VCHR) + devvp->v_rdev->si_mountpt = mp; fs = NULL; sblockloc = 0; @@ -1049,8 +1051,6 @@ ffs_mountfs(devvp, mp, td) ffs_flushfiles(mp, FORCECLOSE, td); goto out; } - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) - devvp->v_rdev->si_mountpt = mp; if (fs->fs_snapinum[0] != 0) ffs_snapshot_mount(mp); fs->fs_fmod = 1; @@ -1083,6 +1083,8 @@ ffs_mountfs(devvp, mp, td) out: if (bp) brelse(bp); + if (devvp->v_type == VCHR && devvp->v_rdev != NULL) + devvp->v_rdev->si_mountpt = NULL; if (cp != NULL) { DROP_GIANT(); g_topology_lock(); From owner-freebsd-fs@freebsd.org Tue May 17 08:42:48 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1D322B3D426 for ; Tue, 17 May 2016 08:42:48 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: from smtp.digiware.nl (unknown [IPv6:2001:4cb8:90:ffff::3]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DB15418C0 for ; Tue, 17 May 2016 08:42:47 +0000 (UTC) (envelope-from wjw@digiware.nl) Received: from rack1.digiware.nl (unknown [127.0.0.1]) by smtp.digiware.nl (Postfix) with ESMTP id 24D29153402; Tue, 17 May 2016 10:42:45 +0200 (CEST) X-Virus-Scanned: amavisd-new at digiware.nl Received: from smtp.digiware.nl ([127.0.0.1]) by rack1.digiware.nl (rack1.digiware.nl [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id LKXrGJgCqvNe; Tue, 17 May 2016 10:42:44 +0200 (CEST) Received: from [IPv6:2001:4cb8:3:1:8c49:9de7:acf1:6a1f] (unknown [IPv6:2001:4cb8:3:1:8c49:9de7:acf1:6a1f]) by smtp.digiware.nl (Postfix) with ESMTP id 301D415340A; Tue, 17 May 2016 10:42:44 +0200 (CEST) Subject: Re: Bigger MAX_PATH (Was: Re: State of native encryption in ZFS) To: Peter Jeremy References: <5736E7B4.1000409@gmail.com> <57378707.19425.B54772B@s_sourceforge.nedprod.com> <57385356.4525.E728971@s_sourceforge.nedprod.com> <9ead4b28-9711-5e38-483f-ef9eaf0bc583@digiware.nl> <20160516200543.GC42426@server.rulingia.com> Cc: "freebsd-fs@FreeBSD.org" From: Willem Jan Withagen Organization: Digiware Management b.v. Message-ID: Date: Tue, 17 May 2016 10:42:32 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <20160516200543.GC42426@server.rulingia.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 08:42:48 -0000 On 16-5-2016 22:05, Peter Jeremy wrote: > On 2016-May-16 15:18:17 +0200, Willem Jan Withagen wrote: >> Trying to port Ceph is also running into the limit in: >> /usr/include/sys/syslimits.h: >> #define NAME_MAX 255 /* max bytes in a file name */ >> >> but I also found: >> /usr/include/stdio.h: >> #define FILENAME_MAX 1024 /* must be <= PATH_MAX */ >> >> So take a pick?? > > There are two distinct limits: The maximum number of characters in a > pathname component (ie the name seen in a directory entry): For UFS, > this is 255 because the length is stored on disk in a uint8_t (I don't > know the limit for ZFS). The other limit is the maximum number of > characters in a pathname - PATH_MAX. This is used to dimension various > buffers but isn't persistent on disk so you should be able to increase > it by changing the relevant #defines and rebuilding everything. Don't remeber if I did such an experiment. Got to talk to the local engineer of dutie here to see if I can get a few more VMs to go compile and blow up. :) Getting the NAME_MAX size per fs is something I'm going to need in the long run for Ceph to make optimal usage of its capabilities. I think that Linux is now at 1024, and the underlaying store for Ceph is going to 4096..... --WjW From owner-freebsd-fs@freebsd.org Tue May 17 09:08:21 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 41F3FB3DBBE; Tue, 17 May 2016 09:08:21 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: from connect.ultra-secure.de (connect.ultra-secure.de [88.198.71.201]) by mx1.freebsd.org (Postfix) with ESMTP id 676BE13C1; Tue, 17 May 2016 09:08:19 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: (Haraka outbound); Tue, 17 May 2016 11:08:18 +0200 Authentication-Results: connect.ultra-secure.de; auth=pass (login); spf=none smtp.mailfrom=ultra-secure.de Received-SPF: None (connect.ultra-secure.de: domain of ultra-secure.de does not designate 127.0.0.16 as permitted sender) receiver=connect.ultra-secure.de; identity=mailfrom; client-ip=127.0.0.16; helo=connect.ultra-secure.de; envelope-from= Received: from connect.ultra-secure.de (expwebmail [127.0.0.16]) by connect.ultra-secure.de (Haraka/2.6.2-toaster) with ESMTPSA id 6E9A37E4-94A9-49FA-B13F-28674A2778A6.1 envelope-from (authenticated bits=0) (version=TLSv1/SSLv3 cipher=AES128-GCM-SHA256 verify=NO); Tue, 17 May 2016 11:08:15 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Date: Tue, 17 May 2016 11:08:14 +0200 From: rainer@ultra-secure.de To: Fabian Keil Cc: FreeBSD Filesystems , owner-freebsd-fs@freebsd.org Subject: Re: zfs receive stalls whole system In-Reply-To: <20160517102757.135c1468@fabiankeil.de> References: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> <20160517102757.135c1468@fabiankeil.de> Message-ID: X-Sender: rainer@ultra-secure.de User-Agent: Roundcube Webmail/1.1.4 X-Haraka-GeoIP: --, , NaNkm X-Haraka-GeoIP-Received: X-Haraka-p0f: os="undefined undefined" link_type="undefined" distance=undefined total_conn=undefined shared_ip=Y X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on spamassassin X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.1 X-Haraka-Karma: score: 6, good: 42, bad: 0, connections: 57, history: 42, pass:all_good, relaying X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 09:08:21 -0000 Am 2016-05-17 10:27, schrieb Fabian Keil: > Rainer Duffner wrote: > >> I have two servers, that were running FreeBSD 10.1-AMD64 for a long >> time, one zfs-sending to the other (via zxfer). Both are NFS-servers >> and MySQL-slaves, the sender is actively used as NFS-server, the >> recipient is just a warm-standby, in case something serious happens >> and we don’t want to wait for a day until the restore is back in >> place. The MySQL-Slaves are actively used as read-only servers (at the >> application level, Python’s SQL-Alchemy does that, apparently). >> >> They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think >> one has 144, the other has 192). >> While they were running 10.1, they used HP P420 RAID-controllers with >> individual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs. >> I use zfsnap to do hourly, daily and weekly snapshots. > [...] >> Now, when I do a zxfer, sometimes the whole system stalls while the >> data is sent over, especially if the delta is large or if something >> else is reading from the disk at the same time (backup agent). >> >> I had this before, on 10.0 (I believe, we didn’t have this in 9.1 >> either, IIRC) and it went away in 10.1. > > Do you use geli for swap device(s)? Yes, I do. /dev/mirror/swap.eli none swap sw 0 0 Bad idea? >> It’s very difficult (well, impossible) to debug, because the system >> totally hangs and doesn’t accept any keypresses. > > You could try reducing ZFS's deadman timeout to get a panic. > On systems with local disks I usually use: > > vfs.zfs.deadman_enabled: 1 > vfs.zfs.deadman_checktime_ms: 5000 > vfs.zfs.deadman_synctime_ms: 10000 Too bad I don't have a spare-system I could use to test this ;-) From owner-freebsd-fs@freebsd.org Tue May 17 09:56:41 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 10D24B3E7FB for ; Tue, 17 May 2016 09:56:41 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from smtp.rlwinm.de (smtp.rlwinm.de [IPv6:2a01:4f8:201:31ef::e]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D01361EE6 for ; Tue, 17 May 2016 09:56:40 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from crest.local (unknown [87.253.189.132]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.rlwinm.de (Postfix) with ESMTPSA id 35B6A86E0 for ; Tue, 17 May 2016 11:56:29 +0200 (CEST) Subject: Re: Best practice for high availability ZFS pool To: freebsd-fs@freebsd.org References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> From: Jan Bramkamp Message-ID: <84e3b485-d8bd-0f2f-47a4-85a64678d286@rlwinm.de> Date: Tue, 17 May 2016 11:56:28 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.0 MIME-Version: 1.0 In-Reply-To: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 09:56:41 -0000 On 16/05/16 12:08, Palle Girgensohn wrote: > Hi, > > We need to set up a ZFS pool with redundance. The main goal is high availability - uptime. > > I can see a few of paths to follow. > > 1. HAST + ZFS > > 2. Some sort of shared storage, two machines sharing a JBOD box. If you're willing to put your disks into JBODs you can use JBODs with two upstream ports per SAS expander and hook up one port to each head node. Now you can access the all disks on both head nodes. The next step you require is reliable master election. Two nodes alone can't form the required consensus. In theory you could use SCSI persistent reservations, but afaik FreeBSD lacks the tooling unless you want send raw SCSI commands through camcontrol. The easier solution is to run some master election using one or three additional nodes for a total of three or five. Both consul and etcd are available as ports and are designed for reliable master election without special hardware. If you go done this way you still need some kind of fencing (maybe via IPMI or PDUs). Now the JBOD is your SPoF so get yourself at least two or better three JBODs. For optimal performance and reliability use three JBODs with 3-way mirrors spread over all JOBDs. In this setup no hardware protects your disks from the hot standby. If it falls out of sync you have to keep it from writing to the shared directed attached storage. One way to achieve this would be to load the SAS HBA kernel module only after the role (primary, backup) has been elected and disable the HBA option ROM in the UEFI/BIOS. I tried this once out of curiosity and it performed well, but good look finding any support for such a setup. The same kind of setup should be possible with iSCSI instead of SAS disks connected to dual ported expanders, but I can't say anything about the performance you can expect from the FreeBSD iSCSI target and initiator. At least it would simply fencing a lot because the fencing could be moved from the SCSI initiators into the SCSI targets. Jan Bramkamp From owner-freebsd-fs@freebsd.org Tue May 17 09:58:56 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 590C9B3E8B6 for ; Tue, 17 May 2016 09:58:56 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x233.google.com (mail-wm0-x233.google.com [IPv6:2a00:1450:400c:c09::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 0E0BF1FCF for ; Tue, 17 May 2016 09:58:56 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x233.google.com with SMTP id e201so132560545wme.0 for ; Tue, 17 May 2016 02:58:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=fVFYb9DEHoOtLYb3+bXfEaJ0BkPZUk/TdYG58pqGLJk=; b=cXL7gTzqQ/z0DwE5cR8VGa/unafK8AjKW+v6Gaiqr4WNPHKkq+M1nQcTZRotnmh+G/ pk93ASZY7cF3JgoBw0FU9D8s2wwD6xod2rpkleRh6UnFGk2pQHz7YqOs2xusqKPKfn7Q zQAbANZR8yL8Kz06sI7z5qug3cjm0GQuMQXxNb53eS8Mp2cH2aMZOsuNUMMCVinm2WvM ctMcNuxG0pGL9wAN/mFaEl9lbCXLC73LX/h6CLIoLXxJWllD8Cim0kyuDyRDRqptUYf7 wXUTw/0yrAVkKv6mU8ESEFdVGG5VlgFjhz63XcRUAez8kLU0MPIHrPz00yCMn6cAP4qt ojOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=fVFYb9DEHoOtLYb3+bXfEaJ0BkPZUk/TdYG58pqGLJk=; b=CaTHLtusM8hSlTeNufOY7rFmFFwHKwndIoi3Y3wUleXBVUnxTBLKUeWLdg7bBmhKPt P+0VSzp7IgU25k9yqEkXJUvwHC1TUm9SImpsRGRuPVGKNY7wb0HKJT41xzCXfBNiZf4d 60TxePet/z/A0sCFI7Uq0jPdt7WFtPSqlItpep92/FgdPfTb36MnMHS6BK8tbqp3PztI TjelVyL85lWnpYuNCt7aWWIhm2/ZZH+GfZDfIkXvUQ+iEXextRRpJoCFFNVdvVjUZqci bEmgdkynHyRZ0DysAymBNI6nvDWzjCvW/euf0HJG1MnOb2h88Zs/0gxmd5FHZI7XhBnB wWvQ== X-Gm-Message-State: AOPr4FXUJzN2GaBVBzoLI/LVLLsFvjXpPzP+kmPDUOmLZjGRLoaJJFTejmqSyD6XP2YfYQ== X-Received: by 10.28.86.10 with SMTP id k10mr458623wmb.96.1463479134591; Tue, 17 May 2016 02:58:54 -0700 (PDT) Received: from [192.168.1.16] (210.236.26.109.rev.sfr.net. [109.26.236.210]) by smtp.gmail.com with ESMTPSA id jp2sm2183352wjc.16.2016.05.17.02.58.53 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 17 May 2016 02:58:53 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Bigger MAX_PATH (Was: Re: State of native encryption in ZFS) From: Ben RUBSON In-Reply-To: <9ead4b28-9711-5e38-483f-ef9eaf0bc583@digiware.nl> Date: Tue, 17 May 2016 11:58:52 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <9F057D48-5413-437B-A612-64D47E95C846@gmail.com> References: <5736E7B4.1000409@gmail.com> <57378707.19425.B54772B@s_sourceforge.nedprod.com> <57385356.4525.E728971@s_sourceforge.nedprod.com> <9ead4b28-9711-5e38-483f-ef9eaf0bc583@digiware.nl> To: "freebsd-fs@FreeBSD.org" X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 09:58:56 -0000 > On 15 may 2016 at 12:45, Niall Douglas wrote: >=20 >>> If FreeBSD had a bigger PATH_MAX then stackable encryptions layers >>> like ecryptfs (encfs?) would be viable choices. Because encrypted >>> path components are so long, one runs very rapidly into the maximum >>> path on the system when PATH_MAX is so low. Could you give us some examples where PATH_MAX was too low for you using = ecryptfs ? I (for the moment) do not run into troubles using EncFS. > http://freebsd.1045724.n5.nabble.com/misc-184340-PATH-MAX-not-interope > rable-with-Linux-td5864469.html And examples where PATH_MAX is too low using Rsync ? Is it too low when we want to sync from Linux to FreeBSD ? Or from = FreeBSD to Linux ? Using Rsync over SSH ? Or using the Rsync daemon on the receiving side ? Thank you very much ! Ben From owner-freebsd-fs@freebsd.org Tue May 17 10:26:43 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 730D8B3EF91 for ; Tue, 17 May 2016 10:26:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 60D271E00 for ; Tue, 17 May 2016 10:26:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 601CDB3EF8F; Tue, 17 May 2016 10:26:43 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5FC29B3EF8D for ; Tue, 17 May 2016 10:26:43 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 10C9E1DFF for ; Tue, 17 May 2016 10:26:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id A332D1A3E13; Tue, 17 May 2016 20:26:33 +1000 (AEST) Date: Tue, 17 May 2016 20:26:26 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs In-Reply-To: <20160517082050.GX89104@kib.kiev.ua> Message-ID: <20160517192933.U4573@besplex.bde.org> References: <20160517072705.F2157@besplex.bde.org> <20160517082050.GX89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=DFMq5MnWpUiX_LZUeQ4A:9 a=l0SOHounc31-8c1A:21 a=KhYsnkSFTUpZV8AS:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 10:26:43 -0000 On Tue, 17 May 2016, Konstantin Belousov wrote: > On Tue, May 17, 2016 at 07:54:27AM +1000, Bruce Evans wrote: >> ffs does very slow shrinking of directories after removing some files >> leaves unused blocks at the end, by always doing synchronous truncation. >> ... >> X Index: ufs_lookup.c >> X =================================================================== >> X --- ufs_lookup.c (revision 299263) >> X +++ ufs_lookup.c (working copy) >> X @@ -1131,9 +1131,9 @@ >> X if (tvp != NULL) >> X VOP_UNLOCK(tvp, 0); >> X error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff, >> X - IO_NORMAL | IO_SYNC, cr); >> X + IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr); >> X if (error != 0) >> X - vprint("ufs_direnter: failted to truncate", dvp); >> X + vprint("ufs_direnter: failed to truncate", dvp); >> X #ifdef UFS_DIRHASH >> X if (error == 0 && dp->i_dirhash != NULL) >> X ufsdirhash_dirtrunc(dp, dp->i_endoff); > > The IO_SYNC flag, for non-journaled SU and any kind of non-SU mounts, > only affects the new blocks allocation mode, and write-out mode for > the last fragment. The truncation itself (for -J) is performed in the > context of the truncating thread. The cg blocks, after the bits are > set to free, are marked for delayed write (with the background write > hack). The inode block is written according to the mount mode, ignoring > IO_SYNC. I don't see why you think that. ffs_truncate() clearly honors IO_SYNC, and testing shows that ffs with soft updates does precisely 7 extra sync writes for directory compaction (where some of the 7 are probably to sync previous activity). I think it would be wrong to ignore IO_SYNC and use the mount mode for inodes. Async mounts still have that bug IIRC (I fixed locally long ago). IO_SYNC is set if the file is is open with O_SYNC and mount mode must not override this. I think ffs has no way of telling that this particular IO_SYNC is not associated with O_SYNC. > That is, for always fully populated directory files, I do not see how > anything is changed by the patch. This problem affects all 512-boundaries, which are rarely block or even fragment boundaries. Test program: X mp=$(df . | grep -v Filesystem | sed 's/ .*//') X echo $mp X while :; X do X echo -n "start: $(/usr/bin/stat -f %5z .); " X mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//' X X touch $(jot 41 0) # just over 512 bytes X echo -n "touch: $(/usr/bin/stat -f %5z .); " X mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//' X X # X # Async mounts are still broken in -current -- these rm's (but nothing X # else here) cause sync writes (but just 1 for the 2 rm's). X # X # rm 39 40 # just under, but no truncation yet X echo -n "rm: $(/usr/bin/stat -f %5z .); " X mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//' X X # X # Another bug in async mounts makes the truncate for the compaction X # triggered by this touch do an async write (with the fix to stop it X # doing a sync write. X # X touch 39 # still under; this creation does the truncation X echo -n "touch 39:$(/usr/bin/stat -f %5z .); " X mount -v | grep $mp | sed -e 's/.*writes/writes/' -e 's/, reads.*//' X X sleep 10 X echo X done I hope this uses a portable enough way to find the mount point. This must be run in an empty directory (or you have to adjust the sizes). Results: - async mount with fix: 1 sync write per iteration. A bogus one triggered by the rm. I only fixed this locally. Remove the rm line so that the size stays slightly about 1024 bytes and there are 0 sync writes. There is also 1 async write triggered by the truncate. This is another bug in async mounts which I have fixed locally. All writes for async mounts should be delayed unless IO_SYNC forces them to be sync. - soft updates: 7 sync writes per iteration all triggered by the final touch (which triggers the compaction). Remove the rm line and there are again 0 sync writes. Sometimes there are 2-5 async writes between the loop iterations. These might be for the loop too, since there are more of them than for async mounts. (I left daemons running whuile testing this on the root file system. Test on a completely idle fs to be sure.) - no soft updates and no async mount: first touch does 3 sync writes, rm does 2 sync, last touch does 4 sync; 0 async writes. The IO_SYNC for soft updates apparently turns all the previous writes for the loop into sync ones. It has to order them and wait for them and there is no better way to wait than a sync write. The ordering makes an unnecessary sync write even more expensive for soft updates than for other cases. Some relevant code in ffs_truncate: Y /* Y * Shorten the size of the file. If the file is not being Y * truncated to a block boundary, the contents of the Y * partial block following the end of the file must be Y * zero'ed in case it ever becomes accessible again because Y * of subsequent file growth. Directories however are not Y * zero'ed as they should grow back initialized to empty. Y */ Y offset = blkoff(fs, length); Y if (offset == 0) { Y ip->i_size = length; Y DIP_SET(ip, i_size, length); Y } else { Y lbn = lblkno(fs, length); Y flags |= BA_CLRBUF; Y error = UFS_BALLOC(vp, length - 1, 1, cred, flags, &bp); Y if (error) { Y return (error); Y } Y /* Y * When we are doing soft updates and the UFS_BALLOC Y * above fills in a direct block hole with a full sized Y * block that will be truncated down to a fragment below, Y * we must flush out the block dependency with an FSYNC Y * so that we do not get a soft updates inconsistency Y * when we create the fragment below. Y */ Y if (DOINGSOFTDEP(vp) && lbn < NDADDR && Y fragroundup(fs, blkoff(fs, length)) < fs->fs_bsize && Y (error = ffs_syncvnode(vp, MNT_WAIT)) != 0) Y return (error); Y ip->i_size = length; Y DIP_SET(ip, i_size, length); Y size = blksize(fs, ip, lbn); Y if (vp->v_type != VDIR) Y bzero((char *)bp->b_data + offset, Y (u_int)(size - offset)); Y /* Kirk's code has reallocbuf(bp, size, 1) here */ Y allocbuf(bp, size); Y if (bp->b_bufsize == fs->fs_bsize) Y bp->b_flags |= B_CLUSTEROK; Y if (flags & IO_SYNC) Y bwrite(bp); Y else Y bawrite(bp); Y } I think we usually arrive here and honor the IO_SYNC flag. This is correct. Otherwise, we always do an async write, but that is wrong for async mounts. Here is my old fix for this: Z diff -u2 ffs_inode.c~ ffs_inode.c Z --- ffs_inode.c~ Wed Apr 7 21:22:26 2004 Z +++ ffs_inode.c Sat Mar 23 01:23:16 2013 Z @@ -345,4 +431,6 @@ Z if (flags & IO_SYNC) Z bwrite(bp); Z + else if (DOINGASYNC(ovp)) Z + bdwrite(bp); Z else Z bawrite(bp); This fix must be sprinkled in most places where there is a bwrite()/ bawrite() decision. Bruce From owner-freebsd-fs@freebsd.org Tue May 17 10:30:16 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 83E92B3F0E9 for ; Tue, 17 May 2016 10:30:16 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x233.google.com (mail-wm0-x233.google.com [IPv6:2a00:1450:400c:c09::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 2C93B101F for ; Tue, 17 May 2016 10:30:16 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x233.google.com with SMTP id a17so22804850wme.0 for ; Tue, 17 May 2016 03:30:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=jty0ntpJDQGFhO9kRh3QxQ/7cIhmQo6O2TEo6JXUcD8=; b=NmWfhwGoZYPQv84VccOt96jk5ETqkdYM93eRba4g9TOf+tKUAHTP3mzu5PfGDeuqEy JA5Bs1pch0pMeG3ar8okJSje6sDq9SZWcRrutGmBpe1sD3IcfglKWF/kZ96+9ZySz2uC JSyE6FBRBvEkQ/MxnD2ZH5lfHf79hJb/UH9fE4yjfLRYza/eysQmZFdN18AvGZfX80Fo SkgsJGW7vUkpCCzwKAGLjJj6Cosnzb2vpCkPXocM2ts+gxwW65tD0PVJOnqgdJ4+WzG6 QPerm2DSeVgHnTlqYItPodXk1xNz65iyTfWTIEWDYyMyJ4FF0S7qeGSV4xVGOTadiuXw a61A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=jty0ntpJDQGFhO9kRh3QxQ/7cIhmQo6O2TEo6JXUcD8=; b=abYT7RgQ8lNOwwg+ccy3y/br34uWppe0zu+a3pt9Fn8Yp88Kxy1g5dCbSjdTgHvfpx 3jHRaqXWFFzdooAMlHOgmkY2nsQifRuefQrtsfXp0/tc3Gn4GB5KZ1xumO9EJmPMTb9Z Tnj784wqiLnDvou8juSxorK0QwSIumARuXD1/f+ejN4TXECqJ+Ii+CEQfjH5FmvSzLZc jOJbLWwBjJIjirrhXkAixSbkyId4MHP9MbeUf+0pASEmeID99XnxlW76TPV42NknJlAV YXdPZYPAavEXN/3kkM+9xAsl9FJ+RPxkPXxVOWM10BSoTmuPWCBK8OHPQqy4WO4wM1Ko GVsA== X-Gm-Message-State: AOPr4FX4ft77QpFfUdSZUwLp5rkCsevvlXs07vLdnWJjRjuoH+WzqMBmTmzCbLcgyIYy5w== X-Received: by 10.194.72.103 with SMTP id c7mr646622wjv.65.1463481014764; Tue, 17 May 2016 03:30:14 -0700 (PDT) Received: from [192.168.1.16] (210.236.26.109.rev.sfr.net. [109.26.236.210]) by smtp.gmail.com with ESMTPSA id lr9sm2297152wjb.39.2016.05.17.03.30.13 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 17 May 2016 03:30:14 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Best practice for high availability ZFS pool From: Ben RUBSON In-Reply-To: Date: Tue, 17 May 2016 12:30:13 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> To: freebsd-fs@freebsd.org X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 10:30:16 -0000 > On 17 may 2016 at 03:43, Bob Friesenhahn = wrote: >=20 > On Mon, 16 May 2016, Palle Girgensohn wrote: >>=20 >> Shared storage still has a single point of failure, the JBOD box. = Apart from that, is there even any support for the kind of storage PCI = cards that support dual head for a storage box? I cannot find any. >=20 > Use two (or three) JBOD boxes and do simple zfs mirroring across them = so you can unplug a JBOD and the pool still works. Or use a bunch of = JBOD boxes and use zfs raidz2 (or raidz3) across them with careful LUN = selection so there is total storage redundancy and you can unplug a JBOD = and the pool still works. >=20 > Fiber channel (or FCoE) or iSCSI allows putting the hardware at some = distance. >=20 > Without completely isolated systems there is always the risk of total = failure. Even with zfs send there is the risk of total failure if the = sent data results in corruption on the receiving side. In this case rollback one of the previous snapshots on the receiving = side ? Did you mean the sent data can totally brake the receiving pool making = it unusable / unable to import ? Did we already see this ? Thank you, Ben= From owner-freebsd-fs@freebsd.org Tue May 17 10:45:01 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A80CCB3F3D8 for ; Tue, 17 May 2016 10:45:01 +0000 (UTC) (envelope-from ronald-lists@klop.ws) Received: from smarthost1.greenhost.nl (smarthost1.greenhost.nl [195.190.28.81]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6D0321763 for ; Tue, 17 May 2016 10:45:00 +0000 (UTC) (envelope-from ronald-lists@klop.ws) Received: from smtp.greenhost.nl ([213.108.104.138]) by smarthost1.greenhost.nl with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.72) (envelope-from ) id 1b2cUd-0001G6-Ak; Tue, 17 May 2016 12:44:51 +0200 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes To: "FreeBSD Filesystems" , "Rainer Duffner" Subject: Re: zfs receive stalls whole system References: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> Date: Tue, 17 May 2016 12:44:50 +0200 MIME-Version: 1.0 Content-Transfer-Encoding: Quoted-Printable From: "Ronald Klop" Message-ID: In-Reply-To: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> User-Agent: Opera Mail/1.0 (Win32) X-Authenticated-As-Hash: 398f5522cb258ce43cb679602f8cfe8b62a256d1 X-Virus-Scanned: by clamav at smarthost1.samage.net X-Spam-Level: / X-Spam-Score: -0.2 X-Spam-Status: No, score=-0.2 required=5.0 tests=ALL_TRUSTED, BAYES_50 autolearn=disabled version=3.4.0 X-Scan-Signature: 67ca9281b58cf5c8a5b2b1d981559170 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 10:45:01 -0000 On Tue, 17 May 2016 01:07:24 +0200, Rainer Duffner = wrote: > Hi, > > I have two servers, that were running FreeBSD 10.1-AMD64 for a long = > time, one zfs-sending to the other (via zxfer). Both are NFS-servers a= nd = > MySQL-slaves, the sender is actively used as NFS-server, the recipient= = > is just a warm-standby, in case something serious happens and we don=E2= =80=99t = > want to wait for a day until the restore is back in place. The = > MySQL-Slaves are actively used as read-only servers (at the applicatio= n = > level, Python=E2=80=99s SQL-Alchemy does that, apparently). > > They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think = = > one has 144, the other has 192). > While they were running 10.1, they used HP P420 RAID-controllers with = = > individual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs. > I use zfsnap to do hourly, daily and weekly snapshots. > > Sending worked well, especially after updating to 10.1 > > Because the storage was over 90% full (and I really hate this = > RAID0-business we have with the HP RAID controllers), I rebuilt the = > servers with HPs OEMed H220/221 controllers (LSI 2308 in disguise) and= = > an external disk shelf, hosting 12 additional disks was added- and I = > upgraded to FreeBSD 10.3. > Because we didn=E2=80=99t want to throw out the original disks, but in= crease = > available space a lot, the new disks are double the size of the origin= al = > disks (600 vs. 1200 GB SAS). > I also created GPT-partitions on the disks and labeled them according = to = > the disk=E2=80=99s position in the cages/shelf, created the pools with= the = > got-partition-names instead of the daX-names. > > Now, when I do a zxfer, sometimes the whole system stalls while the da= ta = > is sent over, especially if the delta is large or if something else is= = > reading from the disk at the same time (backup agent). > > I had this before, on 10.0 (I believe, we didn=E2=80=99t have this in = 9.1 = > either, IIRC) and it went away in 10.1. > > It=E2=80=99s very difficult (well, impossible) to debug, because the s= ystem = > totally hangs and doesn=E2=80=99t accept any keypresses. > > Would a ZIL help in this case? > I always thought that NFS was the only thing that did SYNC writes=E2=80= =A6 Databases love SYNC writes too. (But that doesn't say anything about the= = unresponsive system). I think there is a statistic somewhere in FreeBSD to analyze the sync vs= = async writes and decide if a ZIL will help or not. (But that doesn't say= = anything about the unresponsive system either). Ronald. From owner-freebsd-fs@freebsd.org Tue May 17 10:47:00 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 78943B3F475 for ; Tue, 17 May 2016 10:47:00 +0000 (UTC) (envelope-from ronald-lists@klop.ws) Received: from smarthost1.greenhost.nl (smarthost1.greenhost.nl [195.190.28.81]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 408231979 for ; Tue, 17 May 2016 10:47:00 +0000 (UTC) (envelope-from ronald-lists@klop.ws) Received: from smtp.greenhost.nl ([213.108.104.138]) by smarthost1.greenhost.nl with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:16) (Exim 4.72) (envelope-from ) id 1b2cWg-0001vj-1L; Tue, 17 May 2016 12:46:58 +0200 Content-Type: text/plain; charset=utf-8; format=flowed; delsp=yes To: "FreeBSD Filesystems" , "Rainer Duffner" Subject: Re: zfs receive stalls whole system References: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> Date: Tue, 17 May 2016 12:46:56 +0200 MIME-Version: 1.0 Content-Transfer-Encoding: Quoted-Printable From: "Ronald Klop" Message-ID: In-Reply-To: User-Agent: Opera Mail/1.0 (Win32) X-Authenticated-As-Hash: 398f5522cb258ce43cb679602f8cfe8b62a256d1 X-Virus-Scanned: by clamav at smarthost1.samage.net X-Spam-Level: / X-Spam-Score: -0.2 X-Spam-Status: No, score=-0.2 required=5.0 tests=ALL_TRUSTED, BAYES_50 autolearn=disabled version=3.4.0 X-Scan-Signature: a9e4b997d6a751f3e45cb47a3c2b1d2c X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 10:47:00 -0000 On Tue, 17 May 2016 12:44:50 +0200, Ronald Klop = wrote: > On Tue, 17 May 2016 01:07:24 +0200, Rainer Duffner = > wrote: > >> Hi, >> >> I have two servers, that were running FreeBSD 10.1-AMD64 for a long = >> time, one zfs-sending to the other (via zxfer). Both are NFS-servers = = >> and MySQL-slaves, the sender is actively used as NFS-server, the = >> recipient is just a warm-standby, in case something serious happens a= nd = >> we don=E2=80=99t want to wait for a day until the restore is back in = place. The = >> MySQL-Slaves are actively used as read-only servers (at the applicati= on = >> level, Python=E2=80=99s SQL-Alchemy does that, apparently). >> >> They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think= = >> one has 144, the other has 192). >> While they were running 10.1, they used HP P420 RAID-controllers with= = >> individual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs. >> I use zfsnap to do hourly, daily and weekly snapshots. >> >> Sending worked well, especially after updating to 10.1 >> >> Because the storage was over 90% full (and I really hate this = >> RAID0-business we have with the HP RAID controllers), I rebuilt the = >> servers with HPs OEMed H220/221 controllers (LSI 2308 in disguise) an= d = >> an external disk shelf, hosting 12 additional disks was added- and I = = >> upgraded to FreeBSD 10.3. >> Because we didn=E2=80=99t want to throw out the original disks, but i= ncrease = >> available space a lot, the new disks are double the size of the = >> original disks (600 vs. 1200 GB SAS). >> I also created GPT-partitions on the disks and labeled them according= = >> to the disk=E2=80=99s position in the cages/shelf, created the pools = with the = >> got-partition-names instead of the daX-names. >> >> Now, when I do a zxfer, sometimes the whole system stalls while the = >> data is sent over, especially if the delta is large or if something = >> else is reading from the disk at the same time (backup agent). >> >> I had this before, on 10.0 (I believe, we didn=E2=80=99t have this in= 9.1 = >> either, IIRC) and it went away in 10.1. >> >> It=E2=80=99s very difficult (well, impossible) to debug, because the = system = >> totally hangs and doesn=E2=80=99t accept any keypresses. >> >> Would a ZIL help in this case? >> I always thought that NFS was the only thing that did SYNC writes=E2=80= =A6 > > Databases love SYNC writes too. (But that doesn't say anything about t= he = > unresponsive system). > I think there is a statistic somewhere in FreeBSD to analyze the sync = vs = > async writes and decide if a ZIL will help or not. (But that doesn't s= ay = > anything about the unresponsive system either). > > Ronald. One question. You did not enable dedup(lication)? Ronald. From owner-freebsd-fs@freebsd.org Tue May 17 10:48:45 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9F72EB3F514 for ; Tue, 17 May 2016 10:48:45 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from smtprelay04.ispgateway.de (smtprelay04.ispgateway.de [80.67.18.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 65ABB1A3E for ; Tue, 17 May 2016 10:48:44 +0000 (UTC) (envelope-from freebsd-listen@fabiankeil.de) Received: from [78.35.176.77] (helo=fabiankeil.de) by smtprelay04.ispgateway.de with esmtpsa (TLSv1.2:AES128-GCM-SHA256:128) (Exim 4.84) (envelope-from ) id 1b2cUs-0000f2-Vw for freebsd-fs@freebsd.org; Tue, 17 May 2016 12:45:07 +0200 Date: Tue, 17 May 2016 12:36:27 +0200 From: Fabian Keil To: FreeBSD Filesystems Subject: Re: zfs receive stalls whole system Message-ID: <20160517123627.699e2aa5@fabiankeil.de> In-Reply-To: References: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> <20160517102757.135c1468@fabiankeil.de> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; boundary="Sig_/Fulk3QySoETDWNL8l4bPeY/"; protocol="application/pgp-signature" X-Df-Sender: Nzc1MDY3 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 10:48:45 -0000 --Sig_/Fulk3QySoETDWNL8l4bPeY/ Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable rainer@ultra-secure.de wrote: > Am 2016-05-17 10:27, schrieb Fabian Keil: > > Rainer Duffner wrote: > > =20 > >> I have two servers, that were running FreeBSD 10.1-AMD64 for a long=20 > >> time, one zfs-sending to the other (via zxfer). Both are NFS-servers=20 > >> and MySQL-slaves, the sender is actively used as NFS-server, the=20 > >> recipient is just a warm-standby, in case something serious happens=20 > >> and we don=E2=80=99t want to wait for a day until the restore is back = in=20 > >> place. The MySQL-Slaves are actively used as read-only servers (at the= =20 > >> application level, Python=E2=80=99s SQL-Alchemy does that, apparently). > >>=20 > >> They are HP DL380G8 (one CPU, hexacore) with over 128 GB RAM (I think= =20 > >> one has 144, the other has 192). > >> While they were running 10.1, they used HP P420 RAID-controllers with= =20 > >> individual 12 RAID0 volumes that I pooled into 6-disk RAIDZ2 vdevs. > >> I use zfsnap to do hourly, daily and weekly snapshots. =20 > > [...] =20 > >> Now, when I do a zxfer, sometimes the whole system stalls while the=20 > >> data is sent over, especially if the delta is large or if something=20 > >> else is reading from the disk at the same time (backup agent). > >>=20 > >> I had this before, on 10.0 (I believe, we didn=E2=80=99t have this in = 9.1=20 > >> either, IIRC) and it went away in 10.1. =20 > >=20 > > Do you use geli for swap device(s)? =20 >=20 >=20 > Yes, I do. > /dev/mirror/swap.eli none swap sw 0 0 >=20 > Bad idea? It can cause deadlocks and poor performance when paging. This was recently fixed in ElectroBSD and I intend to submit the patch in a couple of days after a bit more stress testing. The patch is already available at: https://www.fabiankeil.de/sourcecode/electrobsd/GELI-Use-a-dedicated-uma-zo= ne-for-writes.diff Fabian --Sig_/Fulk3QySoETDWNL8l4bPeY/ Content-Type: application/pgp-signature Content-Description: OpenPGP digital signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iEYEARECAAYFAlc69CsACgkQBYqIVf93VJ1CEgCgt1OGGtUcZ/RH421bYTH0ZXaK GsgAn32stEPMDuiCyd5favEJTAKnJtT/ =3xI0 -----END PGP SIGNATURE----- --Sig_/Fulk3QySoETDWNL8l4bPeY/-- From owner-freebsd-fs@freebsd.org Tue May 17 11:17:25 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 12609B3FDD9 for ; Tue, 17 May 2016 11:17:25 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id F067B1E4B for ; Tue, 17 May 2016 11:17:24 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id EFC38B3FDD8; Tue, 17 May 2016 11:17:24 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EF68AB3FDD7 for ; Tue, 17 May 2016 11:17:24 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 7EC451E4A for ; Tue, 17 May 2016 11:17:24 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4HBHFQo055669 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Tue, 17 May 2016 14:17:15 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4HBHFQo055669 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4HBHFYF055668; Tue, 17 May 2016 14:17:15 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Tue, 17 May 2016 14:17:15 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs Message-ID: <20160517111715.GC89104@kib.kiev.ua> References: <20160517072705.F2157@besplex.bde.org> <20160517082050.GX89104@kib.kiev.ua> <20160517192933.U4573@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160517192933.U4573@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 11:17:25 -0000 On Tue, May 17, 2016 at 08:26:26PM +1000, Bruce Evans wrote: > On Tue, 17 May 2016, Konstantin Belousov wrote: > > > On Tue, May 17, 2016 at 07:54:27AM +1000, Bruce Evans wrote: > >> ffs does very slow shrinking of directories after removing some files > >> leaves unused blocks at the end, by always doing synchronous truncation. > >> ... > >> X Index: ufs_lookup.c > >> X =================================================================== > >> X --- ufs_lookup.c (revision 299263) > >> X +++ ufs_lookup.c (working copy) > >> X @@ -1131,9 +1131,9 @@ > >> X if (tvp != NULL) > >> X VOP_UNLOCK(tvp, 0); > >> X error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff, > >> X - IO_NORMAL | IO_SYNC, cr); > >> X + IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr); > >> X if (error != 0) > >> X - vprint("ufs_direnter: failted to truncate", dvp); > >> X + vprint("ufs_direnter: failed to truncate", dvp); > >> X #ifdef UFS_DIRHASH > >> X if (error == 0 && dp->i_dirhash != NULL) > >> X ufsdirhash_dirtrunc(dp, dp->i_endoff); > > > > The IO_SYNC flag, for non-journaled SU and any kind of non-SU mounts, > > only affects the new blocks allocation mode, and write-out mode for > > the last fragment. The truncation itself (for -J) is performed in the > > context of the truncating thread. The cg blocks, after the bits are > > set to free, are marked for delayed write (with the background write > > hack). The inode block is written according to the mount mode, ignoring > > IO_SYNC. > > I don't see why you think that. ffs_truncate() clearly honors IO_SYNC, > and testing shows that ffs with soft updates does precisely 7 extra > sync writes for directory compaction (where some of the 7 are probably > to sync previous activity). ffs_truncate() completely syncs the vnode for non-J truncations. I enumerated bits which are written according to the flags, and it seems to be aligned with what you wrote below. > > I think it would be wrong to ignore IO_SYNC and use the mount mode for > inodes. Async mounts still have that bug IIRC (I fixed locally long > ago). IO_SYNC is set if the file is is open with O_SYNC and mount mode > must not override this. I think ffs has no way of telling that this > particular IO_SYNC is not associated with O_SYNC. > > > That is, for always fully populated directory files, I do not see how > > anything is changed by the patch. > > This problem affects all 512-boundaries, which are rarely block or > even fragment boundaries. Yes, the write-outs of the blocks or fragments at the new end of the file are not needed if the buffer is clear and not newly allocated. But they are performed unconditionally. > > The IO_SYNC for soft updates apparently turns all the previous writes for > the loop into sync ones. It has to order them and wait for them and there > is no better way to wait than a sync write. The ordering makes an > unnecessary sync write even more expensive for soft updates than for > other cases. > > Some relevant code in ffs_truncate: > > Y /* > Y * Shorten the size of the file. If the file is not being > Y * truncated to a block boundary, the contents of the > Y * partial block following the end of the file must be > Y * zero'ed in case it ever becomes accessible again because > Y * of subsequent file growth. Directories however are not > Y * zero'ed as they should grow back initialized to empty. > Y */ > Y offset = blkoff(fs, length); > Y if (offset == 0) { > Y ip->i_size = length; > Y DIP_SET(ip, i_size, length); > Y } else { > Y lbn = lblkno(fs, length); > Y flags |= BA_CLRBUF; > Y error = UFS_BALLOC(vp, length - 1, 1, cred, flags, &bp); > Y if (error) { > Y return (error); > Y } > Y /* > Y * When we are doing soft updates and the UFS_BALLOC > Y * above fills in a direct block hole with a full sized > Y * block that will be truncated down to a fragment below, > Y * we must flush out the block dependency with an FSYNC > Y * so that we do not get a soft updates inconsistency > Y * when we create the fragment below. > Y */ > Y if (DOINGSOFTDEP(vp) && lbn < NDADDR && > Y fragroundup(fs, blkoff(fs, length)) < fs->fs_bsize && > Y (error = ffs_syncvnode(vp, MNT_WAIT)) != 0) > Y return (error); > Y ip->i_size = length; > Y DIP_SET(ip, i_size, length); > Y size = blksize(fs, ip, lbn); > Y if (vp->v_type != VDIR) > Y bzero((char *)bp->b_data + offset, > Y (u_int)(size - offset)); > Y /* Kirk's code has reallocbuf(bp, size, 1) here */ > Y allocbuf(bp, size); > Y if (bp->b_bufsize == fs->fs_bsize) > Y bp->b_flags |= B_CLUSTEROK; > Y if (flags & IO_SYNC) > Y bwrite(bp); > Y else > Y bawrite(bp); > Y } > > I think we usually arrive here and honor the IO_SYNC flag. This is correct. > Otherwise, we always do an async write, but that is wrong for async mounts. > Here is my old fix for this: > > Z diff -u2 ffs_inode.c~ ffs_inode.c > Z --- ffs_inode.c~ Wed Apr 7 21:22:26 2004 > Z +++ ffs_inode.c Sat Mar 23 01:23:16 2013 > Z @@ -345,4 +431,6 @@ > Z if (flags & IO_SYNC) > Z bwrite(bp); > Z + else if (DOINGASYNC(ovp)) > Z + bdwrite(bp); > Z else > Z bawrite(bp); > > This fix must be sprinkled in most places where there is a bwrite()/ > bawrite() decision. No, I do not think that it would be correct for SU mounts. It is essential for the correct operation of e.g. ffs_indirtrunc() that writes for SU case are synchronous, since no dependencies on the indirect block updates are recorded. The fact that syncvnode() is done before is similarly important, because no existing dependencies are cleared. On the other hand, I agree with the note that the final ffs_update() must honour IO_SYNC requests. Anyway, my point was that your patch does not change the hardest source of sync writes, only the write of the final block. I will commit the following. diff --git a/sys/ufs/ffs/ffs_inode.c b/sys/ufs/ffs/ffs_inode.c index 0202820..50b456b 100644 --- a/sys/ufs/ffs/ffs_inode.c +++ b/sys/ufs/ffs/ffs_inode.c @@ -610,7 +610,7 @@ extclean: softdep_journal_freeblocks(ip, cred, length, IO_EXT); else softdep_setup_freeblocks(ip, length, IO_EXT); - return (ffs_update(vp, !DOINGASYNC(vp))); + return (ffs_update(vp, (flags & IO_SYNC) != 0 || !DOINGASYNC(vp))); } /* diff --git a/sys/ufs/ufs/ufs_lookup.c b/sys/ufs/ufs/ufs_lookup.c index 43b4e5c..53536ff 100644 --- a/sys/ufs/ufs/ufs_lookup.c +++ b/sys/ufs/ufs/ufs_lookup.c @@ -1131,7 +1131,7 @@ ufs_direnter(dvp, tvp, dirp, cnp, newdirbp, isrename) if (tvp != NULL) VOP_UNLOCK(tvp, 0); error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff, - IO_NORMAL | IO_SYNC, cr); + IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr); if (error != 0) vprint("ufs_direnter: failed to truncate", dvp); #ifdef UFS_DIRHASH From owner-freebsd-fs@freebsd.org Tue May 17 11:59:58 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6E06FB3EE6C for ; Tue, 17 May 2016 11:59:58 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: from connect.ultra-secure.de (connect.ultra-secure.de [88.198.71.201]) by mx1.freebsd.org (Postfix) with ESMTP id CB5511BB2 for ; Tue, 17 May 2016 11:59:57 +0000 (UTC) (envelope-from rainer@ultra-secure.de) Received: (Haraka outbound); Tue, 17 May 2016 13:59:55 +0200 Authentication-Results: connect.ultra-secure.de; auth=pass (login); spf=none smtp.mailfrom=ultra-secure.de Received-SPF: None (connect.ultra-secure.de: domain of ultra-secure.de does not designate 127.0.0.16 as permitted sender) receiver=connect.ultra-secure.de; identity=mailfrom; client-ip=127.0.0.16; helo=connect.ultra-secure.de; envelope-from= Received: from connect.ultra-secure.de (expwebmail [127.0.0.16]) by connect.ultra-secure.de (Haraka/2.6.2-toaster) with ESMTPSA id 9E04EF33-5B1F-42B7-8D97-D227B86C463D.1 envelope-from (authenticated bits=0) (version=TLSv1/SSLv3 cipher=AES128-GCM-SHA256 verify=NO); Tue, 17 May 2016 13:59:53 +0200 MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Content-Transfer-Encoding: 7bit Date: Tue, 17 May 2016 13:59:53 +0200 From: rainer@ultra-secure.de To: Ronald Klop Cc: FreeBSD Filesystems Subject: Re: zfs receive stalls whole system In-Reply-To: References: <0C2233A9-C64A-4773-ABA5-C0BCA0D037F0@ultra-secure.de> Message-ID: X-Sender: rainer@ultra-secure.de User-Agent: Roundcube Webmail/1.1.4 X-Haraka-GeoIP: --, , NaNkm X-Haraka-GeoIP-Received: X-Haraka-p0f: os="undefined undefined" link_type="undefined" distance=undefined total_conn=undefined shared_ip=Y X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on spamassassin X-Spam-Level: X-Spam-Status: No, score=-2.9 required=5.0 tests=ALL_TRUSTED,BAYES_00, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.1 X-Haraka-Karma: score: 6, good: 43, bad: 0, connections: 58, history: 43, pass:all_good, relaying X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 11:59:58 -0000 Am 2016-05-17 12:46, schrieb Ronald Klop: > On Tue, 17 May 2016 12:44:50 +0200, Ronald Klop > wrote: > One question. You did not enable dedup(lication)? No, certainly not. It's off on all the filesystems. I was sometimes toying with the idea of enabling it, because the dataset is structured in a way where it might actually benefit from dedup. But I didn't go through with it. Thanks Rainer From owner-freebsd-fs@freebsd.org Tue May 17 12:08:57 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9B074B3F73D for ; Tue, 17 May 2016 12:08:57 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: from mx3.lexa.ru (ns503534.ip-198-27-68.net [198.27.68.102]) by mx1.freebsd.org (Postfix) with ESMTP id 808251480 for ; Tue, 17 May 2016 12:08:56 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: by mx3.lexa.ru (Postfix, from userid 66) id C8433224A5E; Tue, 17 May 2016 08:00:07 -0400 (EDT) Received: from [193.124.130.166] (unknown [193.124.130.166]) by home-gw.lexa.ru (Postfix) with ESMTP id 6F73616DE for ; Tue, 17 May 2016 15:00:03 +0300 (MSK) To: freebsd-fs@freebsd.org From: Alex Tutubalin Subject: ZFS performance bottlenecks: CPU or RAM or anything else? Message-ID: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> Date: Tue, 17 May 2016 15:00:03 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 12:08:57 -0000 Hi, I'm new to the list, sorry if the subject was discussed earlier (for many times), just point to archives.... I'm building new storage server for 'linear read/linear write' performance with limited number of parallel data streams (load like read/write multi-gigabyte photoshop files, or read many large raw photo files). Target is to saturate 10G link using SMB or iSCSI. Several years ago I've tested small zpool (5x3Tb 7200rpm drives in RAIDZ) with different CPU/memory combos and have got these results for linear write speed by big chunks: 440 Mb/sec with Core i3-2120/DDR3-1600 ram (2 channel) 360 Mb/sec with core i7-920/DDR3-1333 (3 channel RAM) 280 Mb/sec with Core 2Q Q9300 /DDR2-800 (2 channel) Mixed thoughts: i7-920 is fastest of the three, RAM linear access also fastest, but beaten by i3-2120 with lower latency memory. Also, I've found this link: https://calomel.org/zfs_raid_speed_capacity.html For 6x SSD and 10x SSD in RAIDZ2, there is very similar read speed (1.7Gb/sec) and very close in write speed (721/806 Mb/sec for 6/10 drives). Assuming HBA/PCIe performance to be very same for read and write operations, write speed is not limited by HBA/bus... so it is limited by what? CPU or RAM or ...? So, my question is 'what CPU/memory is optimal for ZFS performance'? In particular: - DDR3 or DDR4 (twice the bandwidth) ? - limited number of cores and high clock rate (e.g. i3-6xxxx) or many cores/slower clock ? No plans to use compression or deduplication, only raidz2 with 8-10 HDD spindles and 3-4-5 SSDs for L2ARC. Alex Tutubalin lexa@lexa.ru From owner-freebsd-fs@freebsd.org Tue May 17 12:11:16 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7E1E5B3F7A5 for ; Tue, 17 May 2016 12:11:16 +0000 (UTC) (envelope-from jg@internetx.com) Received: from mx1.internetx.com (mx1.internetx.com [62.116.129.39]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3CDCB1773 for ; Tue, 17 May 2016 12:11:16 +0000 (UTC) (envelope-from jg@internetx.com) Received: from localhost (localhost [127.0.0.1]) by mx1.internetx.com (Postfix) with ESMTP id 4117445FC0D8; Tue, 17 May 2016 14:11:13 +0200 (CEST) X-Virus-Scanned: InterNetX GmbH amavisd-new at ix-mailer.internetx.de Received: from mx1.internetx.com ([62.116.129.39]) by localhost (ix-mailer.internetx.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id HSvJnI2+xBvQ; Tue, 17 May 2016 14:11:09 +0200 (CEST) Received: from [192.168.100.26] (pizza.internetx.de [62.116.129.3]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mx1.internetx.com (Postfix) with ESMTPSA id 49B5C45FC0D6; Tue, 17 May 2016 14:11:08 +0200 (CEST) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> To: Alex Tutubalin , freebsd-fs@freebsd.org Reply-To: jg@internetx.com From: InterNetX - Juergen Gotteswinter Message-ID: Date: Tue, 17 May 2016 14:11:07 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 12:11:16 -0000 Raidz is your Problem, go for Mirrors Am 5/17/2016 um 2:00 PM schrieb Alex Tutubalin: > Hi, > > I'm new to the list, sorry if the subject was discussed earlier (for > many times), just point to archives.... > > I'm building new storage server for 'linear read/linear write' > performance with limited number of parallel data streams (load like > read/write multi-gigabyte photoshop files, or read many large raw photo > files). > Target is to saturate 10G link using SMB or iSCSI. > > Several years ago I've tested small zpool (5x3Tb 7200rpm drives in > RAIDZ) with different CPU/memory combos and have got these results for > linear write speed by big chunks: > > 440 Mb/sec with Core i3-2120/DDR3-1600 ram (2 channel) > 360 Mb/sec with core i7-920/DDR3-1333 (3 channel RAM) > 280 Mb/sec with Core 2Q Q9300 /DDR2-800 (2 channel) > > Mixed thoughts: i7-920 is fastest of the three, RAM linear access also > fastest, but beaten by i3-2120 with lower latency memory. > > Also, I've found this link: > https://calomel.org/zfs_raid_speed_capacity.html > For 6x SSD and 10x SSD in RAIDZ2, there is very similar read speed > (1.7Gb/sec) and very close in write speed (721/806 Mb/sec for 6/10 drives). > > Assuming HBA/PCIe performance to be very same for read and write > operations, write speed is not limited by HBA/bus... so it is limited by > what? CPU or RAM or ...? > > So, my question is 'what CPU/memory is optimal for ZFS performance'? > > In particular: > - DDR3 or DDR4 (twice the bandwidth) ? > - limited number of cores and high clock rate (e.g. i3-6xxxx) or many > cores/slower clock ? > > No plans to use compression or deduplication, only raidz2 with 8-10 HDD > spindles and 3-4-5 SSDs for L2ARC. > > Alex Tutubalin > lexa@lexa.ru > > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@freebsd.org Tue May 17 12:14:24 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 65FCCB3F8A9 for ; Tue, 17 May 2016 12:14:24 +0000 (UTC) (envelope-from maurizio.vairani@cloverinformatica.it) Received: from host202-129-static.10-188-b.business.telecomitalia.it (host202-129-static.10-188-b.business.telecomitalia.it [188.10.129.202]) by mx1.freebsd.org (Postfix) with ESMTP id 21F391984; Tue, 17 May 2016 12:14:23 +0000 (UTC) (envelope-from maurizio.vairani@cloverinformatica.it) Received: from [192.168.0.60] (MAURIZIO-PC [192.168.0.60]) by host202-129-static.10-188-b.business.telecomitalia.it (Postfix) with ESMTP id C35822C6EC; Tue, 17 May 2016 14:04:28 +0200 (CEST) Subject: Re: Best practice for high availability ZFS pool To: Palle Girgensohn , freebsd-fs@freebsd.org References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> From: Maurizio Vairani Message-ID: <625e2776-a97f-9ee7-a1cb-c1a053804f6c@cloverinformatica.it> Date: Tue, 17 May 2016 14:04:28 +0200 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.0 MIME-Version: 1.0 In-Reply-To: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 12:14:24 -0000 Il 16/05/2016 12:08, Palle Girgensohn ha scritto: > Hi, > > We need to set up a ZFS pool with redundance. The main goal is high availability - uptime. > > I can see a few of paths to follow. > > 1. HAST + ZFS > > 2. Some sort of shared storage, two machines sharing a JBOD box. > > 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) Hi, have you tried compression ? Somethings like: zfs snapshot + zfs send | lzop | ssh | lzop -d | zfs receive I am successfully using this method using a modified version of sysutils/zrep , but my pools are only few TB in size. -- Maurizio From owner-freebsd-fs@freebsd.org Tue May 17 12:24:10 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4D385B3FC11 for ; Tue, 17 May 2016 12:24:10 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: from mx3.lexa.ru (ns503534.ip-198-27-68.net [198.27.68.102]) by mx1.freebsd.org (Postfix) with ESMTP id 3193D1FF2 for ; Tue, 17 May 2016 12:24:09 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: by mx3.lexa.ru (Postfix, from userid 66) id 4750C224A66; Tue, 17 May 2016 08:24:09 -0400 (EDT) Received: from [193.124.130.166] (unknown [193.124.130.166]) by home-gw.lexa.ru (Postfix) with ESMTP id DC9661790 for ; Tue, 17 May 2016 15:21:07 +0300 (MSK) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? To: freebsd-fs@freebsd.org References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> From: Alex Tutubalin Message-ID: <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> Date: Tue, 17 May 2016 15:21:08 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 12:24:10 -0000 On 5/17/2016 3:11 PM, InterNetX - Juergen Gotteswinter wrote: > Raidz is your Problem, go for Mirrors Raidz2 will survive two (any) drives failure, while mirrored stripe will not. So, if it is possible to increase raidz2 performance by faster CPU or RAM I'll go this route Alex From owner-freebsd-fs@freebsd.org Tue May 17 12:31:52 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 01DCDB3FD93 for ; Tue, 17 May 2016 12:31:52 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x235.google.com (mail-wm0-x235.google.com [IPv6:2a00:1450:400c:c09::235]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 93BAE1289 for ; Tue, 17 May 2016 12:31:51 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x235.google.com with SMTP id g17so28215353wme.1 for ; Tue, 17 May 2016 05:31:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=z8EYrhQ0C2CDsBDzacDnjA1TF2BfexlxCJAg3MHLtYM=; b=rkF4zfnj6YUwQHN49pb1ZE+sU7CwrneNg2bLWyXQHX6W0dBZ1uhV6PR00+pnhBrkoK E46/yR6fso2EA2r8YB+oMc6rDdWb3xTDEHXYooMJyjPvBVuZVGfPz3mMScwShQ97ZXUc BkOJFPB8yjuwKpjCw1BcqIr2Gr9hqF0HI85jLMsKMzifUmmXdjqCiRVpusxwpltQf6Ea 47meXpsVxSU/k2a5mlK+8XM2AsA2FzwtNIFkL9kcgN++8FtJiIIvt6csAhWpY7o2bJCF uG1N/Tj5zqVb/vAZjhSjRCFp9ay9QFzrtOr9fS34do9YBGh5+PCBXxI+oWAKychslp2W b9UQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=z8EYrhQ0C2CDsBDzacDnjA1TF2BfexlxCJAg3MHLtYM=; b=lS1xn5ilOhoax9rID3uteton3+BuQGWdRg0sFNxZSe3QQb2+jNFPChgFjLF6jkh6BR soYqJ1zd+CAMil7U9ifmpswjv9FIHhLoLoQBE2JRaNEYPclZtshx4hq8RU5q0oIcijyf alYYf67kDjQE+V17cnCWoCsW8QYyTerhJEzFXFVSLa6Z++Kh0L9AE8Ut9BXev8mWFipv US/v8+w6A6HSjAZ4FWslXaJ6y10jHSg/y5rPmrct5A3s4VQh9st5UGnendW2UDaC0OLC HWcgakMhJjWpp0i43GERykdEFWeAYWjPsEtcIHyC33mVoytRdGwTVyithZFMebNS26Ti WhPQ== X-Gm-Message-State: AOPr4FX6BwBmacSwgy981ZAnSlXlfQce5Frs8rTeEoe0Bwbq5OaZCeBSZqMJNyYXa6c1yZKS X-Received: by 10.194.223.41 with SMTP id qr9mr1226271wjc.61.1463488308915; Tue, 17 May 2016 05:31:48 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id e16sm23879833wmc.3.2016.05.17.05.31.47 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 17 May 2016 05:31:47 -0700 (PDT) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? To: freebsd-fs@freebsd.org References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> From: Steven Hartland Message-ID: <884c4558-c207-596a-3e3e-45a6f579b666@multiplay.co.uk> Date: Tue, 17 May 2016 13:31:53 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 12:31:52 -0000 There's been some recent commits which help sequential reads IIRC, so might be worth checking on CURRENT. On 17/05/2016 13:21, Alex Tutubalin wrote: > On 5/17/2016 3:11 PM, InterNetX - Juergen Gotteswinter wrote: >> Raidz is your Problem, go for Mirrors > > Raidz2 will survive two (any) drives failure, while mirrored stripe > will not. > > So, if it is possible to increase raidz2 performance by faster CPU or > RAM I'll go this route > > Alex > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@freebsd.org Tue May 17 12:36:09 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 155B8B3FF6D for ; Tue, 17 May 2016 12:36:09 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: from mx3.lexa.ru (ns503534.ip-198-27-68.net [198.27.68.102]) by mx1.freebsd.org (Postfix) with ESMTP id ED1321735 for ; Tue, 17 May 2016 12:36:08 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: by mx3.lexa.ru (Postfix, from userid 66) id EFFA2224A5C; Tue, 17 May 2016 08:36:07 -0400 (EDT) Received: from [193.124.130.166] (unknown [193.124.130.166]) by home-gw.lexa.ru (Postfix) with ESMTP id 7B5881827 for ; Tue, 17 May 2016 15:35:22 +0300 (MSK) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? To: freebsd-fs@freebsd.org References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> From: Alex Tutubalin Message-ID: <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> Date: Tue, 17 May 2016 15:35:22 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 12:36:09 -0000 On 5/17/2016 3:29 PM, Daniel Kalchev wrote: > Not true. You can have N-way mirror and it will survive N-1 drive failures. I agree, but 3-way mirror does not looks economical compared to raidz2. > The limitations of RAIDZ performance do not come from CPU or RAM limitations, but by the underlying hardware. RAIDZ is limited to the performance of a single disk IOPS. > > CPU/RAM these days are so much faster than spinning disks or SSDs. Ok. But why I've got different results in my Y2012 testing ( i3-2120 was 1.5 times faster than Q9300 on same HDDs)? Alex From owner-freebsd-fs@freebsd.org Tue May 17 12:41:11 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7094CB3FFFC for ; Tue, 17 May 2016 12:41:11 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from smtp-sofia.digsys.bg (smtp-sofia.digsys.bg [193.68.21.123]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client CN "smtp-sofia.digsys.bg", Issuer "Digital Systems Operational CA" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 0188718D1 for ; Tue, 17 May 2016 12:41:10 +0000 (UTC) (envelope-from daniel@digsys.bg) Received: from [193.68.6.100] ([193.68.6.100]) (authenticated bits=0) by smtp-sofia.digsys.bg (8.14.9/8.14.9) with ESMTP id u4HCTKAQ053293 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Tue, 17 May 2016 15:29:20 +0300 (EEST) (envelope-from daniel@digsys.bg) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? From: Daniel Kalchev In-Reply-To: <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> Date: Tue, 17 May 2016 15:29:20 +0300 Cc: freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> To: Alex Tutubalin X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 12:41:11 -0000 > On 17.05.2016 =D0=B3., at 15:21, Alex Tutubalin wrote: >=20 > On 5/17/2016 3:11 PM, InterNetX - Juergen Gotteswinter wrote: >> Raidz is your Problem, go for Mirrors >=20 > Raidz2 will survive two (any) drives failure, while mirrored stripe = will not. >=20 Not true. You can have N-way mirror and it will survive N-1 drive = failures. > So, if it is possible to increase raidz2 performance by faster CPU or = RAM I'll go this route The limitations of RAIDZ performance do not come from CPU or RAM = limitations, but by the underlying hardware. RAIDZ is limited to the = performance of a single disk IOPS.=20 CPU/RAM these days are so much faster than spinning disks or SSDs. Daniel= From owner-freebsd-fs@freebsd.org Tue May 17 13:24:25 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 19CBEB3D08C for ; Tue, 17 May 2016 13:24:25 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from smtp.simplesystems.org (smtp.simplesystems.org [65.66.246.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id DFA0F2E43 for ; Tue, 17 May 2016 13:24:24 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from freddy.simplesystems.org (freddy.simplesystems.org [65.66.246.65]) by smtp.simplesystems.org (8.14.4+Sun/8.14.4) with ESMTP id u4HDOMxW023751; Tue, 17 May 2016 08:24:22 -0500 (CDT) Date: Tue, 17 May 2016 08:24:22 -0500 (CDT) From: Bob Friesenhahn X-X-Sender: bfriesen@freddy.simplesystems.org To: Ben RUBSON cc: freebsd-fs@freebsd.org Subject: Re: Best practice for high availability ZFS pool In-Reply-To: Message-ID: References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> User-Agent: Alpine 2.20 (GSO 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (smtp.simplesystems.org [65.66.246.90]); Tue, 17 May 2016 08:24:22 -0500 (CDT) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 13:24:25 -0000 On Tue, 17 May 2016, Ben RUBSON wrote: >> >> Without completely isolated systems there is always the risk of total failure. Even with zfs send there is the risk of total failure if the sent data results in corruption on the receiving side. > > In this case rollback one of the previous snapshots on the receiving side ? > Did you mean the sent data can totally brake the receiving pool making it unusable / unable to import ? Did we already see this ? There is at least one case of zfs send propagating a problem into the receiving pool. I don't know if it broke the pool. Corrupt data may be sent from one pool to another if it passes checksums. With any solution, there is the possibility of software bugs. Adding more parallel hardware decreases the chance of data loss but it increases the chance of hardware failure. Bob -- Bob Friesenhahn bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From owner-freebsd-fs@freebsd.org Tue May 17 14:17:05 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BFC61B3DE7D for ; Tue, 17 May 2016 14:17:05 +0000 (UTC) (envelope-from s_sourceforge@nedprod.com) Received: from mail.nedprod.com (europe4.nedproductions.biz [213.251.186.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5EFA82DE5 for ; Tue, 17 May 2016 14:17:04 +0000 (UTC) (envelope-from s_sourceforge@nedprod.com) Received: from authenticated-user (mail.nedprod.com [213.251.186.177]) (using TLSv1.2 with cipher ECDHE-ECDSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.nedprod.com (Postfix) with ESMTPSA id 451B614D78 for ; Tue, 17 May 2016 15:16:57 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=nedprod.com; s=mail; t=1463494617; bh=AV2fSLMfb78I/zkiilNlcTW+c851ayn1D9nWAc/litQ=; h=Resent-from:Resent-to:Resent-date:From:To:Subject:Date:From; b=btrXNiby2SihQYmsQsSMuvDFOY9rbGHAXYUCvp7y9Kge4GSUFGUlgNEZY5d9dmjA/ mHEKPLIv6i5GEuNV1AT4ArGTM5no9vpf0q3gqXbr40YtkdIuhQiMlahjLweeLQv6lP qGpnK4UXRxvPoGpqu1Ia9U6yL0pffGEx8OlEGzqTDAfqsVqGpEBmbY/mnRdWa1QceI DF48fMstbwiA3PIbWB01lKIsWez0S+4HiHw8JaG6BtescM2HSQkCqIpN1wqT43ERpI RFEX/GniZiBb98FFIoMRi9WTK8O9yHH/nRpcaQPB82YuZQ0X3b7UCANxvLVezTvWj3 CsD6xooYs3LCw== Resent-from: "Niall Douglas" Resent-to: freebsd-fs@FreeBSD.org Resent-date: Tue, 17 May 2016 15:17:01 +0100 X-cs: R X-CS-Version: 1.0 From: Niall Douglas X-RS-ID: s_sourceforge X-RS-Flags: 0,0,1,1,0,0,0 X-RS-Header: In-reply-to: <9F057D48-5413-437B-A612-64D47E95C846@gmail.com> X-RS-Header: References: <5736E7B4.1000409@gmail.com>, <9ead4b28-9711-5e38-483f-ef9eaf0bc583@digiware.nl>, <9F057D48-5413-437B-A612-64D47E95C846@gmail.com> X-RS-Sigset: 1 To: freebsd-fs@FreeBSD.org Subject: Re: Bigger MAX_PATH (Was: Re: State of native encryption in ZFS) MIME-Version: 1.0 Content-type: text/plain; charset=UTF-8 Content-transfer-encoding: 8BIT Date: Tue, 17 May 2016 11:37:54 +0100 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 14:17:05 -0000 On 17 May 2016 at 11:58, Ben RUBSON wrote: > >>> If FreeBSD had a bigger PATH_MAX then stackable encryptions layers > >>> like ecryptfs (encfs?) would be viable choices. Because encrypted > >>> path components are so long, one runs very rapidly into the maximum > >>> path on the system when PATH_MAX is so low. > > Could you give us some examples where PATH_MAX was too low for you using ecryptfs ? > I (for the moment) do not run into troubles using EncFS. Sure. I typed this command into my encrypted store to find all paths and sort them by length: find . | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2- And this was the last (longest) path returned: ./ECRYPTFS_FNEK_ENCRYPTED.FWbfg.wnsu2EnUQFbyMTM6advEpfCnjqSMVUiW5.LgoJrMb2-r t6c61qRU--/ECRYPTFS_FNEK_ENCRYPTED.FWbfg.wnsu2EnUQFbyMTM6advEpfCnjqSMVULVNQa FNaPoVWHcEh8FJ8mE--/ECRYPTFS_FNEK_ENCRYPTED.FWbfg.wnsu2EnUQFbyMTM6advEpfCnjq SMVU7LmQnMhHfh0u5yByHsE6r---/ECRYPTFS_FNEK_ENCRYPTED.FWbfg.wnsu2EnUQFbyMTM6a dvEpfCnjqSMVUyFs1x3YH5TrEDJn4uOR7qk--/ECRYPTFS_FNEK_ENCRYPTED.FXbfg.wnsu2EnU QFbyMTM6advEpfCnjqSMVU4ghdTluFURviDBNaKn5dqiV0xCDj5Ikg1JCyAoTTJN6-/ECRYPTFS_ FNEK_ENCRYPTED.FXbfg.wnsu2EnUQFbyMTM6advEpfCnjqSMVU-jycWae440W7yMwmiyP3Y2kL7 WCaoKGKU66C7Cxvk.c-/ECRYPTFS_FNEK_ENCRYPTED.FXbfg.wnsu2EnUQFbyMTM6advEpfCnjq SMVUhi1aG2eEb2eWm.A0HVk-wDsIJHSIpRFFKCNvTGLuRog-/ECRYPTFS_FNEK_ENCRYPTED.FWb fg.wnsu2EnUQFbyMTM6advEpfCnjqSMVUQSSHq93LQdjeusuoEfcYl---/ECRYPTFS_FNEK_ENCR YPTED.FWbfg.wnsu2EnUQFbyMTM6advEpfCnjqSMVUtcq-q29SemW-IOdIxu-WME--/ECRYPTFS_ FNEK_ENCRYPTED.FWbfg.wnsu2EnUQFbyMTM6advEpfCnjqSMVU8IbibKeFBd7fIPHPjbXAUU--/ ECRYPTFS_FNEK_ENCRYPTED.FWbfg.wnsu2EnUQFbyMTM6advEpfCnjqSMVUIRoDoGSCaes2geXo .1ofyE--/ECRYPTFS_FNEK_ENCRYPTED.FWbfg.wnsu2EnUQFbyMTM6advEpfCnjqSMVUhJMdWTL zJ5ZmOdxFdiH61E--/ECRYPTFS_FNEK_ENCRYPTED.FWbfg.wnsu2EnUQFbyMTM6advEpfCnjqSM VUa8r9YMTUfWQ4jQCNcJSCfE--/ECRYPTFS_FNEK_ENCRYPTED.FWbfg.wnsu2EnUQFbyMTM6adv EpfCnjqSMVUX0etb6f3dEb6LYy1ZX6uFk--/ECRYPTFS_FNEK_ENCRYPTED.FXbfg.wnsu2EnUQF byMTM6advEpfCnjqSMVUegHOnwkmJTzIaxaWQLEC-4bmxptKCfKKtzQ3vS4I4Mc- (which is 1356 characters not including the mount point of the encrypted drive) This isn't a particularly crazy encrypted drive. It contains a few backups, accounts, keys and so on. I'm not deliberately storing deep directory trees or anything. > > http://freebsd.1045724.n5.nabble.com/misc-184340-PATH-MAX-not-interope > > rable-with-Linux-td5864469.html > > And examples where PATH_MAX is too low using Rsync ? I've run into this when rsyncing Jenkins workspaces to FreeBSD (Jenkins matrix builder generates long long paths). It isn't just rsync though, it affects extracting tar archives on FreeBSD too. Niall From owner-freebsd-fs@freebsd.org Tue May 17 14:47:02 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 41F30B3E4CA for ; Tue, 17 May 2016 14:47:02 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x22a.google.com (mail-wm0-x22a.google.com [IPv6:2a00:1450:400c:c09::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id CC7013F06 for ; Tue, 17 May 2016 14:47:01 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x22a.google.com with SMTP id e201so143233208wme.0 for ; Tue, 17 May 2016 07:47:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=ZDrpdpEa/I64pn3vvDxSgzgSd0T/PlKMZX/lrkOBJZU=; b=FBAMTorGbZu3zxxQJXRsKVyZMy2tLtRQZ0qn3D/jKmShhVKhHxfdX3zaKXfRxA/bzd J9A9KOKoEUI1sk7lO8rt4vyrwNVkHK9o4Knuhmj/6FJDeghvCEXh/ZO6j2B+uhUMqM+e lbbZ/0i6p10A+Mv2MCGxwrgKw+IDeGj/dTs8butGBC70jxUb8ocoej29gma2oBidCIZI ZMUUyZvOCnF3avd4g9w7CO/KyBR8Tcj6JCWyo6ifuRmEVDu1RQQF2eVGZFpLvfK0qXbl 3LxIRZXoZoqZ8RFQXnilaGTWYiw/S6AV5L41OjQpSygRPOTNAgq/bgmx1NF94AOygLCj cnfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=ZDrpdpEa/I64pn3vvDxSgzgSd0T/PlKMZX/lrkOBJZU=; b=jz4i6YUFwAWcmSpQbDl6NfYSfMvcnSCSsmrfjbHPGuXXmQ+w4B8EbTJsEuwHd16XFu 7b0qyMMgmB67Y/FLqsu3MAODw+ccq96Jextj0ufGkkUcZ8OicuizNwnv93oQ6fWkEu4m 0NrZNEUlHMvAu9BWAJo57pq+obo+EcmAXL8rakKrrE6keOqB3MnU2URWScwBLelYcCTO Kr3On5iYZ4IO/vW5zQmTQqzUpmdTYqBwQvm0o5ee+zk6yuaaO4VtwS+r3vybLb3vRn/E XE0nZMYAzE1zV1A9ajxe7p+4pvIZs/ZCDvOC79d+GZs2Sqq5w23AvU0z08nPXlDA+i7j LbjA== X-Gm-Message-State: AOPr4FVCX47sJBvZBouZorX14muX8m4eP6njloMqHQkPq8fJDpMNFjZWP9YylhLPecf82w== X-Received: by 10.28.31.6 with SMTP id f6mr23254524wmf.69.1463496420394; Tue, 17 May 2016 07:47:00 -0700 (PDT) Received: from [192.168.1.16] (210.236.26.109.rev.sfr.net. [109.26.236.210]) by smtp.gmail.com with ESMTPSA id g3sm3485949wjb.47.2016.05.17.07.46.58 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 17 May 2016 07:46:59 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Best practice for high availability ZFS pool From: Ben RUBSON In-Reply-To: Date: Tue, 17 May 2016 16:46:53 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <40C35566-B7FB-4F59-BB41-D43BC0362C26@gmail.com> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> To: freebsd-fs@freebsd.org X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 14:47:02 -0000 > On 17 may 2016 at 15:24, Bob Friesenhahn = wrote: >=20 > On Tue, 17 May 2016, Ben RUBSON wrote: >>>=20 >>> Without completely isolated systems there is always the risk of = total failure. Even with zfs send there is the risk of total failure if = the sent data results in corruption on the receiving side. >>=20 >> In this case rollback one of the previous snapshots on the receiving = side ? >> Did you mean the sent data can totally brake the receiving pool = making it unusable / unable to import ? Did we already see this ? >=20 > There is at least one case of zfs send propagating a problem into the = receiving pool. I don't know if it broke the pool. Corrupt data may be = sent from one pool to another if it passes checksums. Do you have any link to this problem ? Would be interesting to know if = it was possible to come-back to a previous snapshot / consistent pool. I think that making ZFS send/receive has a higher security level than = mirroring to a second (or third) JBOD box. With mirroring you will still have only one ZFS pool. With send/receive, you have a second / different ZFS pool / data = "envelope", which could (I think) mitigate the "chance" of a broken / = dead pool. Mirror over 2 different JBOD boxes, and send/receive to a third one, is = I think a nice solution. However, if send/receive makes the receiving pool the exact 1:1 copy of = the sending pool, then the thing which made the sending pool to corrupt = could reach (and corrupt) the receiving pool... I don't know whether or not this could occur, and if ever it occurs, if = we have the chance to revert to a previous snapshot, at least on the = receiving side... Ben= From owner-freebsd-fs@freebsd.org Tue May 17 15:35:29 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8B01BB3F561 for ; Tue, 17 May 2016 15:35:29 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x234.google.com (mail-wm0-x234.google.com [IPv6:2a00:1450:400c:c09::234]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 11B1D65FCC for ; Tue, 17 May 2016 15:35:29 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x234.google.com with SMTP id a17so37870716wme.0 for ; Tue, 17 May 2016 08:35:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=B1h9c6mb6sLZC5yI1K+4gySFzHY8bBFLjTP8dgn/CKU=; b=jdSZfqJ3u9293aWr9MMTq/DYAzCVa15/OXav6etiwP8hEwwmJOuIo7Eo8e1MXN7eul 5ka7T9FUJosKfCvvaK7lqPo7bwDlomM2bkFOFniZwTL8b/EgEwIpjSmMSTstiORF3Ll6 gxwsa5GFnNzRpvOUVkTEPG1L11BvDNSbBz8/FkDGP7QIrlv9f97kdRaEqkn9Uicnf3+a kIEnaqdS9InTmcuAyFSCVGGriXFhRPfZt8oAAWc08BKb5bixTKBcYdf+SM+tIGyd5uHq 6iUdl767JGvHAf+jMSQQkcnWjCbYEJX3QriMTHpI5OTA0ZO45b+pyQgrxt8dIg4z7DIz /Uwg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=B1h9c6mb6sLZC5yI1K+4gySFzHY8bBFLjTP8dgn/CKU=; b=NKyHbcehTOKmC6EXrBY2RBiALh1FcPyax/uG22NWoUdMmsAZAsQnndMpXjG0WuF00f 8cX4Ze/yxHcb+0q+YMgXXJfILYzjCQf29rJ6KavwVmvu7niYCOuL5U7hYslEKS515AL6 YzKQg0otZfvgBZqXC07V76JkOQiG82Wmp3KxURbNwmLz1XNXNly3arzcDCkNF/YXy/GJ i3Jsr1teip5fwvzqo6gf79JDQmsLieeJGAaCImTBC8Qysh11Mbe48+D96YdL3X2ThGW1 CKFmVVeaKP+4ZRSs/kkSP0MHUa1ZRbmGXfR440+Bng9Q5aqjB+Um8fVeFlr4GpvJd63v x62Q== X-Gm-Message-State: AOPr4FVE4Oo0en/VfKoV0sJnQuX7qTrFwwWScZMexeVq/H28giphk8RpBrVLjqKrySCNOQ== X-Received: by 10.28.39.196 with SMTP id n187mr2204377wmn.4.1463499327370; Tue, 17 May 2016 08:35:27 -0700 (PDT) Received: from [192.168.1.16] (210.236.26.109.rev.sfr.net. [109.26.236.210]) by smtp.gmail.com with ESMTPSA id kz1sm3705899wjc.46.2016.05.17.08.35.26 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 17 May 2016 08:35:26 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Bigger MAX_PATH (Was: Re: State of native encryption in ZFS) From: Ben RUBSON In-Reply-To: <573b27e8.0604620a.3a15c.ffffe914SMTPIN_ADDED_MISSING@mx.google.com> Date: Tue, 17 May 2016 17:35:25 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <546E5477-E636-49D4-A137-16FDA2CA1E7B@gmail.com> References: <573b27e8.0604620a.3a15c.ffffe914SMTPIN_ADDED_MISSING@mx.google.com> To: "freebsd-fs@FreeBSD.org" X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 15:35:29 -0000 > On 17 may 2016 at 12:37, Niall Douglas wrote: >=20 > On 17 May 2016 at 11:58, Ben RUBSON wrote: >=20 >>>>> If FreeBSD had a bigger PATH_MAX then stackable encryptions layers >>>>> like ecryptfs (encfs?) would be viable choices. Because encrypted >>>>> path components are so long, one runs very rapidly into the = maximum >>>>> path on the system when PATH_MAX is so low. >>=20 >> Could you give us some examples where PATH_MAX was too low for you = using ecryptfs ? >> I (for the moment) do not run into troubles using EncFS. >=20 > Sure. >=20 > I typed this command into my encrypted store to find all paths and = sort=20 > them by length: >=20 > find . | awk '{ print length, $0 }' | sort -n -s | cut -d" " -f2- >=20 > And this was the last (longest) path returned: >=20 > (...) >=20 > (which is 1356 characters not including the mount point of the = encrypted=20 > drive) >=20 > This isn't a particularly crazy encrypted drive. It contains a few = backups,=20 > accounts, keys and so on. I'm not deliberately storing deep directory = trees=20 > or anything. >=20 >>> = http://freebsd.1045724.n5.nabble.com/misc-184340-PATH-MAX-not-interope >>> rable-with-Linux-td5864469.html >>=20 >> And examples where PATH_MAX is too low using Rsync ? >=20 > I've run into this when rsyncing Jenkins workspaces to FreeBSD = (Jenkins=20 > matrix builder generates long long paths). It isn't just rsync though, = it=20 > affects extracting tar archives on FreeBSD too. Thank you Niall for your answer. I managed to reproduce the issue creating a 900 characters path (9 = subfolders of 100 characters each) and Rsyncing it to an EncFS remote = folder. No problem Rsyncing to EncFS on Linux, but it fails on FreeBSD, it only = created 4 of the 9 subfolders. So yes PATH_MAX could really be a limitation. Why did not you make the choice to rebuild the kernel using a higher = PATH_MAX (instead of using ZOL) ? Ben From owner-freebsd-fs@freebsd.org Tue May 17 15:48:18 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F0BE3B3F825 for ; Tue, 17 May 2016 15:48:18 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x22e.google.com (mail-wm0-x22e.google.com [IPv6:2a00:1450:400c:c09::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 873F86795C for ; Tue, 17 May 2016 15:48:18 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x22e.google.com with SMTP id g17so38443557wme.1 for ; Tue, 17 May 2016 08:48:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=C+LWWZlWj7EdsooAVmplL5kc3DR7eCEuWRKTw7DK1yM=; b=dw0FcIRHHTrYdggM9JNIs1fc0ujz+zsNwpJMktbFh7kqoLZipXYpAfDm041RBGORE5 0JobYYDMJro90oQhOlsHSxM+w1MjQVemOXRGt1uyEh1PfaI2RilJtsJ4On86aaOIqV3r nfv70tu4CzfHtRFsfG9fiSLSZ+uR1+f6DrkNGIkdwL7KVYf//+o/S3Od+NspKi7Ffnu7 1MAfS2DOmCtUx6ORik37uV7ZpH5UUwRGV8mkKwGx8cX+S57F0x7bes9MgMO5/GfqgovQ MzTJYuIYoYtLFDqyn+mU+C4EQoujW23awKdhShsMOsaRqI8Be9pAiJSIX5o4FAIr0mrd qMtw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=C+LWWZlWj7EdsooAVmplL5kc3DR7eCEuWRKTw7DK1yM=; b=RraUwOJQNNiXfkeqlyhcCxBUo1PrkhTA0w1mGxuWmvwb943DQJ8KciNWPgKMNt/y2G L17/3J9e7iTu6qgTdo9arIAYbvx8+HF310yUPu95v3TqYAu+gBntKCXTpb04heVm2N7i +5dhEyy2Qjo4wszn5/cT5Rfn0qNcuj0JflMi9sVWGlIH94FPe41JD7JRNg7ZsiGAgRcD JIPE3G2GxacQUebjqn2g3TLlPpq9MihPTg21tXNV6IsrRY4bQc2GTkFPwFkeGww6m5c+ FmBwnuAfDFB2dPBaBzpp5psmaVjgPZTOsUiz1+rq0vfnrWX4+m5yfyX9NxOxA5i3HOlt QUZQ== X-Gm-Message-State: AOPr4FWsv7z/wE3k1dZthdajQ5G9IiCOFTB7yfmbT87iVJ8UzmtNnxFhsOwW5BmIwqY+vQ== X-Received: by 10.194.166.3 with SMTP id zc3mr2300110wjb.104.1463500097109; Tue, 17 May 2016 08:48:17 -0700 (PDT) Received: from [192.168.1.16] (210.236.26.109.rev.sfr.net. [109.26.236.210]) by smtp.gmail.com with ESMTPSA id m140sm24636025wma.24.2016.05.17.08.48.16 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 17 May 2016 08:48:16 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Bigger MAX_PATH (Was: Re: State of native encryption in ZFS) From: Ben RUBSON In-Reply-To: <546E5477-E636-49D4-A137-16FDA2CA1E7B@gmail.com> Date: Tue, 17 May 2016 17:48:15 +0200 Content-Transfer-Encoding: 7bit Message-Id: References: <573b27e8.0604620a.3a15c.ffffe914SMTPIN_ADDED_MISSING@mx.google.com> <546E5477-E636-49D4-A137-16FDA2CA1E7B@gmail.com> To: "freebsd-fs@FreeBSD.org" X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 15:48:19 -0000 For reference, bug report is here : https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=184340 From owner-freebsd-fs@freebsd.org Tue May 17 16:13:26 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 718DBB3FE99 for ; Tue, 17 May 2016 16:13:26 +0000 (UTC) (envelope-from joe@getsomewhere.net) Received: from prak.gameowls.com (prak.gameowls.com [IPv6:2001:19f0:5c00:950b:5400:ff:fe14:46b7]) by mx1.freebsd.org (Postfix) with ESMTP id 4CEF13FB1; Tue, 17 May 2016 16:13:26 +0000 (UTC) (envelope-from joe@getsomewhere.net) Received: from [IPv6:2001:470:c412:beef:135:c8df:2d0e:4ea6] (unknown [IPv6:2001:470:c412:beef:135:c8df:2d0e:4ea6]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by prak.gameowls.com (Postfix) with ESMTPSA id 1BC5118C3D; Tue, 17 May 2016 11:13:18 -0500 (CDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Best practice for high availability ZFS pool From: Joe Love In-Reply-To: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> Date: Tue, 17 May 2016 11:13:18 -0500 Cc: freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <5DA13472-F575-4D3D-80B7-1BE371237CE5@getsomewhere.net> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> To: Palle Girgensohn X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 16:13:26 -0000 > On May 16, 2016, at 5:08 AM, Palle Girgensohn = wrote: >=20 > Hi, >=20 > We need to set up a ZFS pool with redundance. The main goal is high = availability - uptime. >=20 > I can see a few of paths to follow. >=20 > 1. HAST + ZFS >=20 > 2. Some sort of shared storage, two machines sharing a JBOD box. >=20 > 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >=20 > 4. using something else than ZFS, even a different OS if required. >=20 > My main concern with HAST+ZFS is performance. Google offer some = insights here, I find mainly unsolved problems. Please share any success = stories or other experiences. >=20 > Shared storage still has a single point of failure, the JBOD box. = Apart from that, is there even any support for the kind of storage PCI = cards that support dual head for a storage box? I cannot find any. >=20 > We are running with ZFS replication today, but it is just too slow for = the amount of data. >=20 > We prefer to keep ZFS as we already have a rather big (~30 TB) pool = and also tools, scripts, backup all is using ZFS, but if there is no = solution using ZFS, we're open to alternatives. Nexenta springs to mind, = but I believe it is using shared storage for redundance, so it does have = single points of failure? >=20 > Any other suggestions? Please share your experience. :) >=20 > Palle >=20 I don=E2=80=99t know if this falls into the realm of what you want, but = BSDMag just released an issue with an article entitled =E2=80=9CAdding = ZFS to the FreeBSD dual-controller storage concept.=E2=80=9D https://bsdmag.org/download/reusing_openbsd/ My understanding in this setup is that the only single point of failure = for this model is the backplanes that the drives would connect to. = Depending on your controller cards, this could be alleviated by simply = using multiple drive shelves, and only using one drive/shelf as part of = a vdev (then stripe or whatnot over your vdevs). It might not be what you=E2=80=99re after, as it=E2=80=99s basically two = systems with their own controllers, with a shared set of drives. Some = expansion from the virtual world to real physical systems will probably = need additional variations. I think the TrueNAS system (with HA) is setup similar to this, only = without the split between the drives being primarily handled by separate = controllers, but someone with more in-depth knowledge would need to = confirm/deny this. -Joe From owner-freebsd-fs@freebsd.org Tue May 17 16:20:04 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 93633B3F04A for ; Tue, 17 May 2016 16:20:04 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from mail.pingpong.net (mail.pingpong.net [79.136.116.202]) by mx1.freebsd.org (Postfix) with ESMTP id 257DC6454D for ; Tue, 17 May 2016 16:20:03 +0000 (UTC) (envelope-from girgen@FreeBSD.org) Received: from [172.16.0.5] (citron.pingpong.net [195.178.173.66]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.pingpong.net (Postfix) with ESMTPSA id 4821716BE8; Tue, 17 May 2016 18:19:55 +0200 (CEST) Subject: Re: Best practice for high availability ZFS pool Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Content-Type: multipart/signed; boundary="Apple-Mail=_7087E43E-6579-48E7-BFA5-610E1B270D42"; protocol="application/pgp-signature"; micalg=pgp-sha256 X-Pgp-Agent: GPGMail 2.6b2 From: Palle Girgensohn In-Reply-To: <5DA13472-F575-4D3D-80B7-1BE371237CE5@getsomewhere.net> Date: Tue, 17 May 2016 18:19:54 +0200 Cc: freebsd-fs@freebsd.org, Julian Akehurst Message-Id: <7D4449E9-5875-45EB-8559-3B43F2E5E3B0@FreeBSD.org> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <5DA13472-F575-4D3D-80B7-1BE371237CE5@getsomewhere.net> To: Joe Love X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 16:20:04 -0000 --Apple-Mail=_7087E43E-6579-48E7-BFA5-610E1B270D42 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > 17 maj 2016 kl. 18:13 skrev Joe Love : >=20 >=20 >> On May 16, 2016, at 5:08 AM, Palle Girgensohn = wrote: >>=20 >> Hi, >>=20 >> We need to set up a ZFS pool with redundance. The main goal is high = availability - uptime. >>=20 >> I can see a few of paths to follow. >>=20 >> 1. HAST + ZFS >>=20 >> 2. Some sort of shared storage, two machines sharing a JBOD box. >>=20 >> 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >>=20 >> 4. using something else than ZFS, even a different OS if required. >>=20 >> My main concern with HAST+ZFS is performance. Google offer some = insights here, I find mainly unsolved problems. Please share any success = stories or other experiences. >>=20 >> Shared storage still has a single point of failure, the JBOD box. = Apart from that, is there even any support for the kind of storage PCI = cards that support dual head for a storage box? I cannot find any. >>=20 >> We are running with ZFS replication today, but it is just too slow = for the amount of data. >>=20 >> We prefer to keep ZFS as we already have a rather big (~30 TB) pool = and also tools, scripts, backup all is using ZFS, but if there is no = solution using ZFS, we're open to alternatives. Nexenta springs to mind, = but I believe it is using shared storage for redundance, so it does have = single points of failure? >>=20 >> Any other suggestions? Please share your experience. :) >>=20 >> Palle >>=20 >=20 > I don=E2=80=99t know if this falls into the realm of what you want, = but BSDMag just released an issue with an article entitled =E2=80=9CAdding= ZFS to the FreeBSD dual-controller storage concept.=E2=80=9D > https://bsdmag.org/download/reusing_openbsd/ >=20 > My understanding in this setup is that the only single point of = failure for this model is the backplanes that the drives would connect = to. Depending on your controller cards, this could be alleviated by = simply using multiple drive shelves, and only using one drive/shelf as = part of a vdev (then stripe or whatnot over your vdevs). >=20 > It might not be what you=E2=80=99re after, as it=E2=80=99s basically = two systems with their own controllers, with a shared set of drives. = Some expansion from the virtual world to real physical systems will = probably need additional variations. > I think the TrueNAS system (with HA) is setup similar to this, only = without the split between the drives being primarily handled by separate = controllers, but someone with more in-depth knowledge would need to = confirm/deny this. >=20 > -Joe >=20 This is actually very interesting IMO. It is simple and easy to understand. Problem is I didn't find any proper = controller cards for it. I think this is what Nexenta does as well as = TrueNAS, with their HA versions. I'll check out the article, thanks! Palle --Apple-Mail=_7087E43E-6579-48E7-BFA5-610E1B270D42 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP using GPGMail -----BEGIN PGP SIGNATURE----- Comment: GPGTools - http://gpgtools.org iQEcBAEBCAAGBQJXO0SrAAoJEDQn0sf36UlsInYIAMPaUZ8Fw5YYy0Zqk3/1JpL0 q4KLG8+iCMuagZWJyarF5EdmEJAw+hEuWRbG8uAH1gr7XS8BEN58QutxI6zKKdVm LSKgpXCxlOQdR3M/fJuE09t+YWepcs+MmAbR8ns5YoceURZU1rXNdjTwGdhqPvk3 PfFVhPX6CiFG3YlqsGcfAKfqVBbhkzmh5bvg7rHGH+TIZDx3qTsOhnW97j86Rr5V rV2Egf6vEOCuJN8GvzQAmE4E7X2+o+kS2EugUtbWCAurmK0/kM3qTC2+7BpQW2vn dgUmXX6wElNSTIOyBzksLMlq7L4fxi0Gdv2p1EOWP1LU9AKDInNMDrfpkogN/Ls= =4d2V -----END PGP SIGNATURE----- --Apple-Mail=_7087E43E-6579-48E7-BFA5-610E1B270D42-- From owner-freebsd-fs@freebsd.org Tue May 17 17:06:44 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E98EBB3F35A for ; Tue, 17 May 2016 17:06:44 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from smtp.simplesystems.org (smtp.simplesystems.org [65.66.246.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id B88661105 for ; Tue, 17 May 2016 17:06:44 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from freddy.simplesystems.org (freddy.simplesystems.org [65.66.246.65]) by smtp.simplesystems.org (8.14.4+Sun/8.14.4) with ESMTP id u4HH6fv9028372; Tue, 17 May 2016 12:06:42 -0500 (CDT) Date: Tue, 17 May 2016 12:06:41 -0500 (CDT) From: Bob Friesenhahn X-X-Sender: bfriesen@freddy.simplesystems.org To: Ben RUBSON cc: freebsd-fs@freebsd.org Subject: Re: Best practice for high availability ZFS pool In-Reply-To: <40C35566-B7FB-4F59-BB41-D43BC0362C26@gmail.com> Message-ID: References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <40C35566-B7FB-4F59-BB41-D43BC0362C26@gmail.com> User-Agent: Alpine 2.20 (GSO 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (smtp.simplesystems.org [65.66.246.90]); Tue, 17 May 2016 12:06:42 -0500 (CDT) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 17:06:45 -0000 On Tue, 17 May 2016, Ben RUBSON wrote: >> On 17 may 2016 at 15:24, Bob Friesenhahn wrote: >> >> There is at least one case of zfs send propagating a problem into the receiving pool. I don't know if it broke the pool. Corrupt data may be sent from one pool to another if it passes checksums. > > Do you have any link to this problem ? Would be interesting to know if it was possible to come-back to a previous snapshot / consistent pool. I don't have a link but I recall that it had something to do with the ability to send file 'holes' in the stream. > I think that making ZFS send/receive has a higher security level than mirroring to a second (or third) JBOD box. > With mirroring you will still have only one ZFS pool. This is a reasonable assumption. > However, if send/receive makes the receiving pool the exact 1:1 copy > of the sending pool, then the thing which made the sending pool to > corrupt could reach (and corrupt) the receiving pool... I don't know > whether or not this could occur, and if ever it occurs, if we have > the chance to revert to a previous snapshot, at least on the > receiving side... Zfs receive does not result in a 1:1 copy. The underlying data organization can be completely different and compression or other options can be changed. Bob -- Bob Friesenhahn bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From owner-freebsd-fs@freebsd.org Tue May 17 18:00:45 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D0516B3F2BB for ; Tue, 17 May 2016 18:00:45 +0000 (UTC) (envelope-from lkateley@kateley.com) Received: from mail-io0-x22f.google.com (mail-io0-x22f.google.com [IPv6:2607:f8b0:4001:c06::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A5FFB1C1F for ; Tue, 17 May 2016 18:00:45 +0000 (UTC) (envelope-from lkateley@kateley.com) Received: by mail-io0-x22f.google.com with SMTP id i75so33759545ioa.3 for ; Tue, 17 May 2016 11:00:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kateley-com.20150623.gappssmtp.com; s=20150623; h=reply-to:subject:references:to:from:organization:message-id:date :user-agent:mime-version:in-reply-to:content-transfer-encoding; bh=wP2UykMcSjQT9UMzYWI3gy6Q2O3oGjERkEmxjNa9Mk8=; b=jvjoRUiURtJav+iXlTT9V8az+nURg55tF/X9oqY5+rJIRdvIUrwdBFY4c2upNluZFN TZTtIgvhXMywMoU5au4dpBLbBOFX3ucjOyg5Vh75aT4eegh8iwFCYHkGF/IG5K83vZtz dO2piTvjdvU76wixOG0qn/xe8yslZw6FXnJyVi4WF+xFXoHE41iVinGZ1oblx1/f03ag xX1vKrSwYYEYndQUvPkv6G9SYZe8vxW0dsuktWT6xVObwZOtyiamHbW8+HBi7LgZvWzc kRhCBA2dYWc/JbOTa/yZi+mcgNhF9urNcWhxsyutrZt9vNnXaUcV+kihDn88Qbd7t9NL IuOg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:reply-to:subject:references:to:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-transfer-encoding; bh=wP2UykMcSjQT9UMzYWI3gy6Q2O3oGjERkEmxjNa9Mk8=; b=ZgfdViujFJQcwxR+F3Jql8exIgAO+asYCggEjax1AqerwGM0l0LmgmZdQZ4zoiom7o ppWypqYnh4o4wi/JhOqmT0t/5gp2xC/ZmqmM6wIu9dV2JKBobg6mxC4JjQBJ3zMmgxVp NFMbEi+VeExbs1Up7hv0D+Nnl3NAJKHNAxfu34ILRyNS3gvM6t2X/yYLkAGyNVfx/ShA YcUDDTfO/7Wwl0mMqD07+DWcbMDynABU5p56Ph30d4dL4ZhtNsgUkbv7dU9izRuR6OAp MGSDGGxAD9KYv1AV2yBOSIpojPG1AYTmNRlduBRCI0mkMqNedIYU4IG9zipHUxGlqMvv 3WIA== X-Gm-Message-State: AOPr4FWoZ7l+qZCwWdFvDrXb+/BOKz5Esq8I7ddJjDhkAZpTJwkyDyIRLa0nKB6CGLHmxA== X-Received: by 10.36.83.20 with SMTP id n20mr2175221itb.61.1463508045094; Tue, 17 May 2016 11:00:45 -0700 (PDT) Received: from [192.168.0.4] ([63.231.252.189]) by smtp.googlemail.com with ESMTPSA id j188sm1399272ita.8.2016.05.17.11.00.43 for (version=TLSv1/SSLv3 cipher=OTHER); Tue, 17 May 2016 11:00:44 -0700 (PDT) Reply-To: linda@kateley.com Subject: Re: Best practice for high availability ZFS pool References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <5DA13472-F575-4D3D-80B7-1BE371237CE5@getsomewhere.net> To: freebsd-fs@freebsd.org From: Linda Kateley Organization: Kateley Company Message-ID: <573B5C4B.80406@kateley.com> Date: Tue, 17 May 2016 13:00:43 -0500 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: <5DA13472-F575-4D3D-80B7-1BE371237CE5@getsomewhere.net> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 18:00:45 -0000 On 5/17/16 11:13 AM, Joe Love wrote: >> On May 16, 2016, at 5:08 AM, Palle Girgensohn wrote: >> >> Hi, >> >> We need to set up a ZFS pool with redundance. The main goal is high availability - uptime. >> >> I can see a few of paths to follow. >> >> 1. HAST + ZFS >> >> 2. Some sort of shared storage, two machines sharing a JBOD box. >> >> 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >> >> 4. using something else than ZFS, even a different OS if required. >> >> My main concern with HAST+ZFS is performance. Google offer some insights here, I find mainly unsolved problems. Please share any success stories or other experiences. >> >> Shared storage still has a single point of failure, the JBOD box. Apart from that, is there even any support for the kind of storage PCI cards that support dual head for a storage box? I cannot find any. >> >> We are running with ZFS replication today, but it is just too slow for the amount of data. >> >> We prefer to keep ZFS as we already have a rather big (~30 TB) pool and also tools, scripts, backup all is using ZFS, but if there is no solution using ZFS, we're open to alternatives. Nexenta springs to mind, but I believe it is using shared storage for redundance, so it does have single points of failure? >> >> Any other suggestions? Please share your experience. :) For true high availability there is an application RSF-1 that can get full HA. I am not sure the exact failover times, but the last time I talked to them, it was very low. They also run higher up in ZFS. >> >> Palle >> > I don’t know if this falls into the realm of what you want, but BSDMag just released an issue with an article entitled “Adding ZFS to the FreeBSD dual-controller storage concept.” > https://bsdmag.org/download/reusing_openbsd/ > > My understanding in this setup is that the only single point of failure for this model is the backplanes that the drives would connect to. Most of the jbods you can buy also have the ability to have dual backplanes also > Depending on your controller cards, this could be alleviated by simply using multiple drive shelves, and only using one drive/shelf as part of a vdev (then stripe or whatnot over your vdevs). > > It might not be what you’re after, as it’s basically two systems with their own controllers, with a shared set of drives. Some expansion from the virtual world to real physical systems will probably need additional variations. > I think the TrueNAS system (with HA) is setup similar to this, only without the split between the drives being primarily handled by separate controllers, but someone with more in-depth knowledge would need to confirm/deny this. > > -Joe > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@freebsd.org Tue May 17 19:04:27 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C5A8BB3E25C for ; Tue, 17 May 2016 19:04:27 +0000 (UTC) (envelope-from brandon.wandersee@gmail.com) Received: from mail-ig0-f171.google.com (mail-ig0-f171.google.com [209.85.213.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9D71918CF for ; Tue, 17 May 2016 19:04:27 +0000 (UTC) (envelope-from brandon.wandersee@gmail.com) Received: by mail-ig0-f171.google.com with SMTP id bi2so80216550igb.0 for ; Tue, 17 May 2016 12:04:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:references:user-agent:from:to:cc:subject :in-reply-to:date:message-id:mime-version; bh=qXUBYXlS1mLmujjk3kms9ENcUdPNe0C8oGEQ3MbMq98=; b=Zg9rue/4lvyDuFeNQ0FGZ/ixCw6ridrbh/suV4sMFRo8kVoEZ/HLTvyhfURAIIZpCq kbwz1DerbeXM1dQrgrMqLKwmnt59wUe1/UZg8dAFHLO4G+OHs0D+sv9k6Cz48irLmXs/ WcDeTyQr/vNEAxaPlD0/HNCjjDV9Y8WBAbMq3HT7VbQGoVGHThAnFZs3yIHOrT6vretO h+hqogtAI3tiCgf6kGICcui6oR5nnwNk/euQA4FqoDvMjVlulzNRXe4zLNaUAxcERBU2 +IC/a7gOguN1B0C1FqRTBskUNEzvpX2bR1NwF14Z05G5DQJhXUahz4sAurqZr5H11JT7 9pgg== X-Gm-Message-State: AOPr4FUYo2MOQpD7EUwUYvYARaODmRIs1hD35K4MNJDXZxh3abFttmTuZ622nk4Lj8Yc3g== X-Received: by 10.50.140.193 with SMTP id ri1mr15275060igb.60.1463511861711; Tue, 17 May 2016 12:04:21 -0700 (PDT) Received: from WorkBox.Home.gmail.com (97-116-8-66.mpls.qwest.net. [97.116.8.66]) by smtp.gmail.com with ESMTPSA id g186sm1478399iof.27.2016.05.17.12.04.19 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 17 May 2016 12:04:20 -0700 (PDT) References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> User-agent: mu4e 0.9.16; emacs 24.5.1 From: Brandon J. Wandersee To: Alex Tutubalin Cc: freebsd-fs@freebsd.org Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? In-reply-to: <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> Date: Tue, 17 May 2016 14:04:18 -0500 Message-ID: <86shxgsdzh.fsf@WorkBox.Home> MIME-Version: 1.0 Content-Type: text/plain X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 19:04:27 -0000 Alex Tutubalin writes: > On 5/17/2016 3:29 PM, Daniel Kalchev wrote: > >> Not true. You can have N-way mirror and it will survive N-1 drive failures. > I agree, but 3-way mirror does not looks economical compared to raidz2. If you're already planning for multiple simultaneous drive failures, "economical" isn't really a factor, is it? Those disks have to get replaced regardless of the redundancy scheme you assign to them. ;) Whether the concern is performance or capacity, mirrors will offer the most flexibility. Increasing either the performance or capacity of a RAIDZ pool necessitates either replacing every disk in the pool or doubling the number of disks in the pool, all at once. Mirrors allow you to grow a pool and increase/decrease redundancy asymmetrically. True, four disks in a two-mirror stripe will see you restoring a backup if one disk from each mirror dies, but (arguably) six disks in a two-mirror stripe offer both better redundancy and better performance. Speaking strictly about performance, RAIDZ performance is pretty much fixed, while mirrored performance will (I believe) increase slightly as you add disks and increase greatly as you add vdevs. -- :: Brandon J. Wandersee :: brandon.wandersee@gmail.com :: -------------------------------------------------- :: 'The best design is as little design as possible.' :: --- Dieter Rams ---------------------------------- From owner-freebsd-fs@freebsd.org Tue May 17 19:11:06 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6E7C3B3E397 for ; Tue, 17 May 2016 19:11:06 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 5B6101B79 for ; Tue, 17 May 2016 19:11:06 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 571DDB3E396; Tue, 17 May 2016 19:11:06 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 54778B3E394 for ; Tue, 17 May 2016 19:11:06 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id EF8D71B76 for ; Tue, 17 May 2016 19:11:05 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id 9F912780BD4; Wed, 18 May 2016 05:11:01 +1000 (AEST) Date: Wed, 18 May 2016 05:11:01 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs In-Reply-To: <20160517111715.GC89104@kib.kiev.ua> Message-ID: <20160518035413.L4357@besplex.bde.org> References: <20160517072705.F2157@besplex.bde.org> <20160517082050.GX89104@kib.kiev.ua> <20160517192933.U4573@besplex.bde.org> <20160517111715.GC89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=c+ZWOkJl c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=m90GG2ySlDWwqfHBogYA:9 a=5Ij-lXQwDHgBn59Q:21 a=qLoGEIkQGsovWur8:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 19:11:06 -0000 On Tue, 17 May 2016, Konstantin Belousov wrote: > On Tue, May 17, 2016 at 08:26:26PM +1000, Bruce Evans wrote: >> On Tue, 17 May 2016, Konstantin Belousov wrote: >> >>> On Tue, May 17, 2016 at 07:54:27AM +1000, Bruce Evans wrote: >>>> ffs does very slow shrinking of directories after removing some files >>>> leaves unused blocks at the end, by always doing synchronous truncation. >>>> ... >>>> X Index: ufs_lookup.c >>>> X =================================================================== >>>> X --- ufs_lookup.c (revision 299263) >>>> X +++ ufs_lookup.c (working copy) >>>> X @@ -1131,9 +1131,9 @@ >>>> X if (tvp != NULL) >>>> X VOP_UNLOCK(tvp, 0); >>>> X error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff, >>>> X - IO_NORMAL | IO_SYNC, cr); >>>> X + IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr); >>>> X if (error != 0) >>>> X - vprint("ufs_direnter: failted to truncate", dvp); >>>> X + vprint("ufs_direnter: failed to truncate", dvp); I keep looking at wrong versions. I checked the old version and now see another problem with this "failted" message (which you fixed). It is debugging code and shouldn't be printed at all. Old versions ignored errors from the truncation since the truncation is supposed to be optional but that was broken for dirhash so r262812 added error handling. If the error handling actually works, then this becomes a non-error. >> Some relevant code in ffs_truncate: This was from an old versions. Perhaps r181717. FreeBSD-8 is similar, but FreeBSD-9+ has most of my DOINGASYNC() additions and -current has just 2 more of them than FreeBSD-9. >> Y /* >> Y * Shorten the size of the file. If the file is not being >> Y * truncated to a block boundary, the contents of the >> Y * partial block following the end of the file must be >> Y * zero'ed in case it ever becomes accessible again because >> Y * of subsequent file growth. Directories however are not >> Y * zero'ed as they should grow back initialized to empty. >> Y */ >> Y offset = blkoff(fs, length); >> Y if (offset == 0) { >> Y ip->i_size = length; >> Y DIP_SET(ip, i_size, length); >> Y } else { >> Y lbn = lblkno(fs, length); >> Y flags |= BA_CLRBUF; >> Y error = UFS_BALLOC(vp, length - 1, 1, cred, flags, &bp); >> Y if (error) { >> Y return (error); >> Y } >> Y /* >> Y * When we are doing soft updates and the UFS_BALLOC >> Y * above fills in a direct block hole with a full sized >> Y * block that will be truncated down to a fragment below, >> Y * we must flush out the block dependency with an FSYNC >> Y * so that we do not get a soft updates inconsistency >> Y * when we create the fragment below. >> Y */ >> Y if (DOINGSOFTDEP(vp) && lbn < NDADDR && >> Y fragroundup(fs, blkoff(fs, length)) < fs->fs_bsize && >> Y (error = ffs_syncvnode(vp, MNT_WAIT)) != 0) >> Y return (error); >> Y ip->i_size = length; >> Y DIP_SET(ip, i_size, length); >> Y size = blksize(fs, ip, lbn); >> Y if (vp->v_type != VDIR) >> Y bzero((char *)bp->b_data + offset, >> Y (u_int)(size - offset)); >> Y /* Kirk's code has reallocbuf(bp, size, 1) here */ >> Y allocbuf(bp, size); >> Y if (bp->b_bufsize == fs->fs_bsize) >> Y bp->b_flags |= B_CLUSTEROK; >> Y if (flags & IO_SYNC) >> Y bwrite(bp); >> Y else >> Y bawrite(bp); FreeBSD-9+ already has my DOINGASYNC() fix here. However, an async write is still done when DOINGASYNC(). It is done by vtruncbuf() 50 lines after here. vtruncbuf() doesn't know about DOINGASYNC(). It turns delayed writes into unconditional async ones. >> Y } >> >> I think we usually arrive here and honor the IO_SYNC flag. This is correct. >> Otherwise, we always do an async write, but that is wrong for async mounts. >> Here is my old fix for this: >> >> Z diff -u2 ffs_inode.c~ ffs_inode.c >> Z --- ffs_inode.c~ Wed Apr 7 21:22:26 2004 >> Z +++ ffs_inode.c Sat Mar 23 01:23:16 2013 >> Z @@ -345,4 +431,6 @@ >> Z if (flags & IO_SYNC) >> Z bwrite(bp); >> Z + else if (DOINGASYNC(ovp)) >> Z + bdwrite(bp); >> Z else >> Z bawrite(bp); >> >> This fix must be sprinkled in most places where there is a bwrite()/ >> bawrite() decision. > No, I do not think that it would be correct for SU mounts. It is essential SU silently ignores the async mount flag (by silently killing it instead of ignoring it later), so the DOINGASYNC() checks don't affect it. > for the correct operation of e.g. ffs_indirtrunc() that writes for SU > case are synchronous, since no dependencies on the indirect block updates > are recorded. The fact that syncvnode() is done before is similarly > important, because no existing dependencies are cleared. > > On the other hand, I agree with the note that the final ffs_update() > must honour IO_SYNC requests. > > Anyway, my point was that your patch does not change the hardest source > of sync writes, only the write of the final block. I will commit the > following. Er, it fixes all cases of directory shrinking for async mounts. All cases should probably use watermarks and shrink at block or frag boundaries instead of 512-boundaries. E.g., for small directories, shrink if size - endoff >= fs_fsize && . WIth fs_fsize = 2K, this gives for example: - size <= 2K: never shrink - size nearly 4K but endoff between 1K and 2K: don't shrink, because shrinking would free a frag but not leave much space for expansion. > diff --git a/sys/ufs/ffs/ffs_inode.c b/sys/ufs/ffs/ffs_inode.c > index 0202820..50b456b 100644 > --- a/sys/ufs/ffs/ffs_inode.c > +++ b/sys/ufs/ffs/ffs_inode.c > @@ -610,7 +610,7 @@ extclean: > softdep_journal_freeblocks(ip, cred, length, IO_EXT); > else > softdep_setup_freeblocks(ip, length, IO_EXT); > - return (ffs_update(vp, !DOINGASYNC(vp))); > + return (ffs_update(vp, (flags & IO_SYNC) != 0 || !DOINGASYNC(vp))); > } > > /* Oops, this needs fixing in my version, but in -current the fix has little effect since in -current ffs_update() still dishonors the waitfor flag for its bwrite()/bdwrite() decision if DOINGASYNC(). This is essentially the same as dishonoring the IO_SYNC flag here. ffs_update() needs the same fix in 4 more places. > diff --git a/sys/ufs/ufs/ufs_lookup.c b/sys/ufs/ufs/ufs_lookup.c > index 43b4e5c..53536ff 100644 > --- a/sys/ufs/ufs/ufs_lookup.c > +++ b/sys/ufs/ufs/ufs_lookup.c > @@ -1131,7 +1131,7 @@ ufs_direnter(dvp, tvp, dirp, cnp, newdirbp, isrename) > if (tvp != NULL) > VOP_UNLOCK(tvp, 0); > error = UFS_TRUNCATE(dvp, (off_t)dp->i_endoff, > - IO_NORMAL | IO_SYNC, cr); > + IO_NORMAL | (DOINGASYNC(dvp) ? 0 : IO_SYNC), cr); > if (error != 0) > vprint("ufs_direnter: failed to truncate", dvp); > #ifdef UFS_DIRHASH > OK. I want this to avoid _any_ sync writes here for async mounts even after the excessive truncations are fixed. Perhaps vtruncbuf() should just check the async mount flag to avoid async writes (except possibly when buf_dirty_count_severe()). Bruce From owner-freebsd-fs@freebsd.org Tue May 17 19:28:10 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B2E36B3EB1E for ; Tue, 17 May 2016 19:28:10 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: from mx3.lexa.ru (ns503534.ip-198-27-68.net [198.27.68.102]) by mx1.freebsd.org (Postfix) with ESMTP id 95EB01383 for ; Tue, 17 May 2016 19:28:09 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: by mx3.lexa.ru (Postfix, from userid 66) id 1F830224A61; Tue, 17 May 2016 15:28:08 -0400 (EDT) Received: from [193.124.130.166] (unknown [193.124.130.166]) by home-gw.lexa.ru (Postfix) with ESMTP id 38382CA5 for ; Tue, 17 May 2016 22:24:19 +0300 (MSK) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> To: freebsd-fs@freebsd.org From: Alex Tutubalin Message-ID: Date: Tue, 17 May 2016 22:24:19 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <86shxgsdzh.fsf@WorkBox.Home> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 19:28:10 -0000 > If you're already planning for multiple simultaneous drive failures, > "economical" isn't really a factor, is it? Those disks have to get > replaced regardless of the redundancy scheme you assign to them. ;) I do not plan failures, but there always a chance of bad drive model. I've survived 3-Tb Seagates without data loss. It was not possible in my case without RAID6 (hardware). > Speaking strictly about performance, RAIDZ performance is pretty much > fixed, Anyway, my thread-starting question is different: I see great performance difference on same pool connected to different CPU/ram combo. I do not know what caused this difference: CPU speed, RAM bandwidth, or RAM latency. May be someone in this list has benchmarked ZFS RAIDZ for performance and knows what is the bottleneck? Alex Tutubalin From owner-freebsd-fs@freebsd.org Tue May 17 19:40:19 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EC41CB3D061 for ; Tue, 17 May 2016 19:40:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id D91BE1D0B for ; Tue, 17 May 2016 19:40:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id D4FC3B3D05F; Tue, 17 May 2016 19:40:19 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D2582B3D05E for ; Tue, 17 May 2016 19:40:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 8F5851D0A for ; Tue, 17 May 2016 19:40:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 4E38C1046239; Wed, 18 May 2016 05:40:10 +1000 (AEST) Date: Wed, 18 May 2016 05:40:07 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Konstantin Belousov , fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs In-Reply-To: <20160518035413.L4357@besplex.bde.org> Message-ID: <20160518052656.R5764@besplex.bde.org> References: <20160517072705.F2157@besplex.bde.org> <20160517082050.GX89104@kib.kiev.ua> <20160517192933.U4573@besplex.bde.org> <20160517111715.GC89104@kib.kiev.ua> <20160518035413.L4357@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=c+ZWOkJl c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=7sixqL4dHYFS359oKKQA:9 a=91Z7TcaMQPEi1bVU:21 a=sc9wKKxEHgg0U6Hn:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 19:40:20 -0000 On Wed, 18 May 2016, Bruce Evans wrote: > On Tue, 17 May 2016, Konstantin Belousov wrote: >> diff --git a/sys/ufs/ffs/ffs_inode.c b/sys/ufs/ffs/ffs_inode.c >> index 0202820..50b456b 100644 >> --- a/sys/ufs/ffs/ffs_inode.c >> +++ b/sys/ufs/ffs/ffs_inode.c >> @@ -610,7 +610,7 @@ extclean: >> softdep_journal_freeblocks(ip, cred, length, IO_EXT); >> else >> softdep_setup_freeblocks(ip, length, IO_EXT); >> - return (ffs_update(vp, !DOINGASYNC(vp))); >> + return (ffs_update(vp, (flags & IO_SYNC) != 0 || !DOINGASYNC(vp))); >> } >> >> /* > > Oops, this needs fixing in my version, but in -current the fix has > little effect since in -current ffs_update() still dishonors the waitfor > flag for its bwrite()/bdwrite() decision if DOINGASYNC(). This is > essentially the same as dishonoring the IO_SYNC flag here. > > ffs_update() needs the same fix in 4 more places. Also, ftruncate() seems to be broken. POSIX doesn't seem to require it to honor O_SYNC, but POLA requires this. But there is no VOP_TRUNCATE(); truncation is done using VOP_SETATTR() and there is no way to pass down the O_SYNC flag to it; in practice, ffs just does UFS_TRUNCATE() without IO_SYNC. This makes a difference mainly for async mounts with my fixes to honor IO_SYNC in ffs_update(). With async mounts, consistency of the file system is not guaranteed but O_SYNC for a file should at least cause all of the file data and most of its metdata to be written. Not syncing for ftruncate() unnecessarily loses metadata writes. With !async mounts, consistency of the file system is partly guaranteed and lost metadata writes for ftruncate() shouldn't affect this -- they should just lose the ftruncate() atomically. vfs could do an fsync() after VOP_SETATTR() for the O_SYNC case. This reduces the race window. Bruce From owner-freebsd-fs@freebsd.org Tue May 17 20:39:56 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7D24CB3FAB6 for ; Tue, 17 May 2016 20:39:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 6B4E214EE for ; Tue, 17 May 2016 20:39:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 6A8DDB3FAB5; Tue, 17 May 2016 20:39:56 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6A310B3FAB3 for ; Tue, 17 May 2016 20:39:56 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 37A8314ED for ; Tue, 17 May 2016 20:39:55 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id AE2BA428F79; Wed, 18 May 2016 06:39:52 +1000 (AEST) Date: Wed, 18 May 2016 06:39:49 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160517084241.GY89104@kib.kiev.ua> Message-ID: <20160518061040.D5948@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=QJha7pIZtpZdJjQaBnUA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 20:39:56 -0000 On Tue, 17 May 2016, Konstantin Belousov wrote: > On Tue, May 17, 2016 at 07:26:08AM +1000, Bruce Evans wrote: >> Counting of i/o's in g_vfs_strategy() requires the fs to initialize >> devvp->v_rdev->si_mountpt to non-null. This seems to be done correctly >> in ext2fs and msdosfs, but in ffs it is not done for ro mounts, or for >> rw mounts that started as ro. The bug is most obvious for the root >> file system since it always starts as ro. > > I committed the comments updates. > > For the accounting patch, don't we want to account for all io, including > the mount-time metadata reads and initial superblock update ? > > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > index 9776554..712fc21 100644 > --- a/sys/ufs/ffs/ffs_vfsops.c > +++ b/sys/ufs/ffs/ffs_vfsops.c > @@ -780,6 +780,8 @@ ffs_mountfs(devvp, mp, td) > mp->mnt_iosize_max = MAXPHYS; > > devvp->v_bufobj.bo_ops = &ffs_ops; > + if (devvp->v_type == VCHR) > + devvp->v_rdev->si_mountpt = mp; > > fs = NULL; > sblockloc = 0; > @@ -1049,8 +1051,6 @@ ffs_mountfs(devvp, mp, td) > ffs_flushfiles(mp, FORCECLOSE, td); > goto out; > } > - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) > - devvp->v_rdev->si_mountpt = mp; > if (fs->fs_snapinum[0] != 0) > ffs_snapshot_mount(mp); > fs->fs_fmod = 1; > @@ -1083,6 +1083,8 @@ ffs_mountfs(devvp, mp, td) > out: > if (bp) > brelse(bp); > + if (devvp->v_type == VCHR && devvp->v_rdev != NULL) > + devvp->v_rdev->si_mountpt = NULL; > if (cp != NULL) { > DROP_GIANT(); > g_topology_lock(); Yes, that looks better. The other file systems that support the counters (ext2fs and msdosfs) need a similar change. Grepping for si_mountpoint shows no other file systems that support this. The recently axed reiserfs sets si_mountpt, but only if si_mountpt is #defined. This only works in old versions: - in old versions, si_mountpt is #defined. GEOM broke this, and the #define was removed. The ifdef kept reiserfs compiling. History for reiserfs was broken by repo-copying after the ifdefed was added. - mckusick fixed the counting for ffs and restored si_mountpt, but it is now not #define'd. The following file systems are something like ffs so they should set si_mountpt, but don't: cd9660, fuse, nandfs (?), udf, zfs. I only understand cd9660 and udf. The following file systems used to set si_mountpt but now don't: hpfs (axed), ntfs (axed), udf (but not cd9660). Counters for the ro file systems are only moderately useful. They tell you that the block the block size is too small and/or if the clustering is bad. No version seems to be as careful as the above -- they don't set si_mountpt until near the end of a sucessful mount. This takes just 1 statement in mount() and 1 in umount(). I'd like vfs to do this setting so that leaf file systems can't forget to do it, but never figured out the plumbing to tell upper layers of vfs about devvp. g_vfs_open() can't do it since it knows devvp but not mp. Bruce From owner-freebsd-fs@freebsd.org Tue May 17 21:11:25 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8A665B3F2D6 for ; Tue, 17 May 2016 21:11:25 +0000 (UTC) (envelope-from steven@multiplay.co.uk) Received: from mail-wm0-x22f.google.com (mail-wm0-x22f.google.com [IPv6:2a00:1450:400c:c09::22f]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 15BB019FD for ; Tue, 17 May 2016 21:11:24 +0000 (UTC) (envelope-from steven@multiplay.co.uk) Received: by mail-wm0-x22f.google.com with SMTP id g17so50917814wme.1 for ; Tue, 17 May 2016 14:11:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=18HaE95cOAcNXvLcgKRPFkuzI+tn9EgBFp0adEt+bLM=; b=YS/lDuk7Jwu3N2EsuJ4/VFu60UCK20GpgHV9LTqiw/AhwGwJnbNGW9Q7eTpFxbVHyy tV+tzTNYJGG1duBa2R3QQunJwgP2WSytUmWSjLyt2KTHbbzsxEYZbPFaicnFbhTBB7C2 0Nxxz7JTgyLzAZyO382/5cP9IkAmutNg3fhj3yZZxbCzgUDouUj0esrPCK08pmiwLY+x 7iCgd73li5/H8I1+Unr8NvSdgyq6Z8v9hcnqGDteu8I3Y3eDy6mPAJFhVYkDLb/a5mMO fDxKDKeY+iqe+EIII7HiOf5WIlF+T3QXuWACDuqgqkVwyOLhs/ZA3AQ1LRgXkALZ7Yv8 2sqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=18HaE95cOAcNXvLcgKRPFkuzI+tn9EgBFp0adEt+bLM=; b=Kcg7Rql9vGP9LQwwtuPWcSrS7lu0RIwnJacIrH54jUmoMZ2pn4sVfcBCV7p+oUt0rX v2cvzaxH4KW2iqKe5oqDbcGK92AZSyhvTqsHCKRr+ZHvqQVq8oCZhVYmEC7wYtzTe3E3 3knms84pJ2IYFJN6LD3Ty2EyWyjtCplnX+3mDQ+Ny/weFiNfuKCEHq8zx7z2dI+Gnk8r Gd0OC6liQaLkGTDcSxJEpKmzHFv1VMfx/VKiFQg0K6vio6tkJBDQ/UrqxO4lOPVHzsQ4 4VZqGS9X3BVXKU9di1YJ2NRAJJIvQwuZwAvbPcW5K26DlYQofPCBMKP2AKZRCJTEG2c5 4Qkg== X-Gm-Message-State: AOPr4FWICj2RBIai881B0XxcNPvt4doGJEtSoW3ooHJOIkRWBzOQx+cD+Njx8p8i7xVHRFHA6HPopTv9Tr/IYlwa MIME-Version: 1.0 X-Received: by 10.194.139.104 with SMTP id qx8mr3425725wjb.14.1463519483422; Tue, 17 May 2016 14:11:23 -0700 (PDT) Received: by 10.28.93.203 with HTTP; Tue, 17 May 2016 14:11:23 -0700 (PDT) In-Reply-To: <86shxgsdzh.fsf@WorkBox.Home> References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> Date: Tue, 17 May 2016 22:11:23 +0100 Message-ID: Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? From: Steven Hartland To: "Brandon J. Wandersee" Cc: Alex Tutubalin , "freebsd-fs@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 21:11:25 -0000 Raidz is limited essential limited to a single drive performance per dev for read and write while mirror is single drive performance for write its number of drives for read. Don't forget mirror is not limited to two it can be three, four or more; so if you need more read throughput you can add drives to the mirror. To increase raidz performance you need to add more vdevs. While this doesn't have to be double i.e. the same vdev config as the first it generally a good idea. Don't forget that while it rebalances write performance of a multi vdev raidz will be limited to the added vdev. On Tuesday, 17 May 2016, Brandon J. Wandersee wrote: > > Alex Tutubalin writes: > > > On 5/17/2016 3:29 PM, Daniel Kalchev wrote: > > > >> Not true. You can have N-way mirror and it will survive N-1 drive > failures. > > I agree, but 3-way mirror does not looks economical compared to raidz2. > > If you're already planning for multiple simultaneous drive failures, > "economical" isn't really a factor, is it? Those disks have to get > replaced regardless of the redundancy scheme you assign to them. ;) > > Whether the concern is performance or capacity, mirrors will offer the > most flexibility. Increasing either the performance or capacity of a > RAIDZ pool necessitates either replacing every disk in the pool or > doubling the number of disks in the pool, all at once. Mirrors allow you > to grow a pool and increase/decrease redundancy asymmetrically. True, > four disks in a two-mirror stripe will see you restoring a backup if one > disk from each mirror dies, but (arguably) six disks in a two-mirror > stripe offer both better redundancy and better performance. > > Speaking strictly about performance, RAIDZ performance is pretty much > fixed, while mirrored performance will (I believe) increase slightly as > you add disks and increase greatly as you add vdevs. > > -- > > :: Brandon J. Wandersee > :: brandon.wandersee@gmail.com > :: -------------------------------------------------- > :: 'The best design is as little design as possible.' > :: --- Dieter Rams ---------------------------------- > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org > " > From owner-freebsd-fs@freebsd.org Tue May 17 21:16:17 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 75680B3F456 for ; Tue, 17 May 2016 21:16:17 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: from mail-io0-x232.google.com (mail-io0-x232.google.com [IPv6:2607:f8b0:4001:c06::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 3D6FA1CE1 for ; Tue, 17 May 2016 21:16:17 +0000 (UTC) (envelope-from fjwcash@gmail.com) Received: by mail-io0-x232.google.com with SMTP id f89so40661181ioi.0 for ; Tue, 17 May 2016 14:16:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=kkj1+UCBiYndbNrtv0GY52CycSb/yBIhTS5KmeN8Hno=; b=TIZhrcngAZYRJ6cGCDKupKT/vFj3s9R7uB56jomdtKCUgC8dGe9L9fuGF2VnedX2DT wEkocW6M/+DJKGNg/rVIw9PqRUd8F/mg38bDt/a5Q11S8I5XgfV19N0e8fcKu8V10jsz OQ0lZX5iLn7en0jcVVrS05F9n3TiMm+SCtn34NaAuFXCXiy7O6+LT5/S9FMHX/j5ynG8 FuIyEFLc7ZP5dr/8ACVQvRcbkdAPy3XwJ1uBrBfQWtzEMjmpO+D2jBWugLYlEmPiM91x OlZbsIgV6Q/SHwOPlkZQVvX/oj7lCpeqFCFt5czoS9wV82wmIo2fpA+9XCdw+hZCciWn r1OA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=kkj1+UCBiYndbNrtv0GY52CycSb/yBIhTS5KmeN8Hno=; b=WDiT5D60qP/ukDGmfVfiADk6Plcxkycc12455MsACSU09zQY7iY2PRFUnLbJTR87VF 1C5L+bk6jvn/nSEk9OoRiiSF4mgpubyaeKGfOd7LOOoYNvoc/qBFgT08UV2i5Zlyvm4j AVvvE0mpg31Vt5lrBRA1/SN086q4WFsbiV/+0ujb0f00CtbwKQKcpy+JHoZquIqbE1El w40q2jBAOxBt/0fMTFiKWby58/OF6VZBA5xMaoc0AWNyLSt08jrsjo2UQDnZA9obGAUk jr/Qz5FtQLpLeJexXvZHkY6IB8ZHrTV9z4NJDhvV5xvOT2AAVlLDeugaW8gCrdHiYFps SRAw== X-Gm-Message-State: AOPr4FWVagn6ObPY5Fj7nNlVx/yeZbQ7/o+V7bKWLypXloDUY9PTizUyECHbFAh8C9LJreIsVLie13BjM8lmGA== MIME-Version: 1.0 X-Received: by 10.107.134.24 with SMTP id i24mr2716422iod.130.1463519776605; Tue, 17 May 2016 14:16:16 -0700 (PDT) Received: by 10.107.173.79 with HTTP; Tue, 17 May 2016 14:16:16 -0700 (PDT) In-Reply-To: References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> Date: Tue, 17 May 2016 14:16:16 -0700 Message-ID: Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? From: Freddie Cash To: Steven Hartland Cc: "Brandon J. Wandersee" , "freebsd-fs@freebsd.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 21:16:17 -0000 On Tue, May 17, 2016 at 2:11 PM, Steven Hartland wrote: > Raidz is limited essential limited to a single drive performance > per dev for read and write while mirror is single drive performance for > write its number of drives for read. Don't forget mirror is not limited t= o > two it can be three, four or more; so if you need more read throughput yo= u > can add drives to the mirror. > > To increase raidz performance you need to add more vdevs. While this > doesn't have to be double i.e. the same vdev config as the first it > generally a good idea. > > Don't forget that while it rebalances write performance of a multi vdev > raidz will be limited to the added vdev. > =E2=80=8BEverybody is missing the point of the OP. They're not asking for ways to improve the performance of a raidz-based pool; they're asking why they get different performance metrics from the exact same pool when they change the CPU and RAM. And, more importantly, why a Core-i3-based system shows better performance than a Core-i7-based system.=E2=80=8B Is there something inherent to the w= ay ZFS works that favours one setup over another (lower CPU core counts running at higher speeds is better/worse than higher CPU core counts running at lower speeds; more RAM channels is better/worse; things like that). --=20 Freddie Cash fjwcash@gmail.com From owner-freebsd-fs@freebsd.org Tue May 17 21:22:37 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B74E4B3F62F for ; Tue, 17 May 2016 21:22:37 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id A06B21099 for ; Tue, 17 May 2016 21:22:37 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id 9BDD0B3F62C; Tue, 17 May 2016 21:22:37 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9B85DB3F62B for ; Tue, 17 May 2016 21:22:37 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4560C1098 for ; Tue, 17 May 2016 21:22:37 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4HLMRFu006554 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 18 May 2016 00:22:27 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4HLMRFu006554 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4HLMR6Z006553; Wed, 18 May 2016 00:22:27 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 18 May 2016 00:22:27 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs Message-ID: <20160517212227.GE89104@kib.kiev.ua> References: <20160517072705.F2157@besplex.bde.org> <20160517082050.GX89104@kib.kiev.ua> <20160517192933.U4573@besplex.bde.org> <20160517111715.GC89104@kib.kiev.ua> <20160518035413.L4357@besplex.bde.org> <20160518052656.R5764@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160518052656.R5764@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 21:22:37 -0000 On Wed, May 18, 2016 at 05:40:07AM +1000, Bruce Evans wrote: > Also, ftruncate() seems to be broken. POSIX doesn't seem to require it > to honor O_SYNC, but POLA requires this. But there is no VOP_TRUNCATE(); > truncation is done using VOP_SETATTR() and there is no way to pass down > the O_SYNC flag to it; in practice, ffs just does UFS_TRUNCATE() without > IO_SYNC. > > This makes a difference mainly for async mounts with my fixes to honor > IO_SYNC in ffs_update(). With async mounts, consistency of the file > system is not guaranteed but O_SYNC for a file should at least cause > all of the file data and most of its metdata to be written. Not syncing > for ftruncate() unnecessarily loses metadata writes. With !async mounts, > consistency of the file system is partly guaranteed and lost metadata > writes for ftruncate() shouldn't affect this -- they should just lose > the ftruncate() atomically. > > vfs could do an fsync() after VOP_SETATTR() for the O_SYNC case. This > reduces the race window. vattr already has the va_vaflags field. It is trivial to add flag there requesting O_SYNC behaviour. Of course, other updates could also honour VA_SYNC, but this is for later. Like this: diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c index 0a3a88a..1e42a3d 100644 --- a/sys/kern/vfs_vnops.c +++ b/sys/kern/vfs_vnops.c @@ -1314,6 +1314,8 @@ vn_truncate(struct file *fp, off_t length, struct ucred *active_cred, if (error == 0) { VATTR_NULL(&vattr); vattr.va_size = length; + if ((fp->f_flag & O_FSYNC) != 0) + vattr.va_vaflags |= VA_SYNC; error = VOP_SETATTR(vp, &vattr, fp->f_cred); } out: diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h index e82f6ee..41ec7f7 100644 --- a/sys/sys/vnode.h +++ b/sys/sys/vnode.h @@ -286,6 +286,7 @@ struct vattr { */ #define VA_UTIMES_NULL 0x01 /* utimes argument was NULL */ #define VA_EXCLUSIVE 0x02 /* exclusive create request */ +#define VA_SYNC 0x04 /* O_SYNC truncation */ /* * Flags for ioflag. (high 16 bits used to ask for read-ahead and diff --git a/sys/ufs/ufs/ufs_vnops.c b/sys/ufs/ufs/ufs_vnops.c index c0729f8..83df347 100644 --- a/sys/ufs/ufs/ufs_vnops.c +++ b/sys/ufs/ufs/ufs_vnops.c @@ -625,7 +625,8 @@ ufs_setattr(ap) */ return (0); } - if ((error = UFS_TRUNCATE(vp, vap->va_size, IO_NORMAL, + if ((error = UFS_TRUNCATE(vp, vap->va_size, IO_NORMAL | + ((vap->va_vaflags & VA_SYNC) != 0 ? IO_SYNC : 0), cred)) != 0) return (error); } From owner-freebsd-fs@freebsd.org Tue May 17 21:28:24 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EA9A2B3F906 for ; Tue, 17 May 2016 21:28:24 +0000 (UTC) (envelope-from steven@multiplay.co.uk) Received: from mail-wm0-x232.google.com (mail-wm0-x232.google.com [IPv6:2a00:1450:400c:c09::232]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A664C1A2B for ; Tue, 17 May 2016 21:28:24 +0000 (UTC) (envelope-from steven@multiplay.co.uk) Received: by mail-wm0-x232.google.com with SMTP id r12so7433440wme.0 for ; Tue, 17 May 2016 14:28:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=z7vstwTQiUNHIdnrUIAx7ViqJDOqm6/6JMO4H7a+tYQ=; b=AAdvZCT1eg2AgFpWM6/1yLfVKmwRGPAIyLDV5vppFlDbpsypakG/Qc5YyjGBOfAm2O 4+bjR82+v77Q7ixDolp0+cL6ho0HEsZS6/BKnbSGDUmBsqIPw4YntSWJy9trTAdEzKlc lsbef3u6bpNcD8IEQ4Q1SOKv/kGvLH4Y5cb4bYHO4TVHovSrKHoNhn8/5yXgU1XE5A56 BZU4hMmn7EpiUMmd4vOivHHk4p71mWdiGalhpHHt4LBnVnA/rUmIsNmKLc5O8pD1Jg3C /2oY1o2CH2p/55NNLr2QsAFqtERLuGYWHPeRbAd7FKYqgMcd74aMSDiV1bliFTfN+TIZ P5fA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=z7vstwTQiUNHIdnrUIAx7ViqJDOqm6/6JMO4H7a+tYQ=; b=ipZR72+TlMbikBQ1tGMYBJayjW+HUGrMYaxP8QUxh3jW3T++2IlmNhEc5VM5jS8/Zo wuRpxGaTUx+n43O49oSs2oSa3gVPKcACRm72vfL2+LhYQ3eDrN9efFu2z3VU2Xen+olo /aabXnJPjJfK1VNTURQ7MF3VDL/a4/yxfsy8Xlgn7+C+vmb5e5PCN0hYT928KUML6h8L uksf9UhlytvdgYcL7DuOQILmUrFknfa3dvSOiCP16iW9vn69+pD4zt5Vu3oMlWsIEztJ A0iRrZd+xOda60CrPOjUTiD6H9Q4h3uev+IxpGfbXrjZweWmICfFdNOEuuJgzGANi4rF xsvg== X-Gm-Message-State: AOPr4FUKSlKvD0j2neQYKuun6V3pti+KkHKGGHvuZ9B41tvnO9MuNwfxnSL6RuMZ3OnulrUDY/+kfEEnDoauHU4i MIME-Version: 1.0 X-Received: by 10.194.163.229 with SMTP id yl5mr3582806wjb.6.1463520503074; Tue, 17 May 2016 14:28:23 -0700 (PDT) Received: by 10.28.93.203 with HTTP; Tue, 17 May 2016 14:28:22 -0700 (PDT) In-Reply-To: References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> Date: Tue, 17 May 2016 22:28:22 +0100 Message-ID: Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? From: Steven Hartland To: Freddie Cash Cc: "Brandon J. Wandersee" , "freebsd-fs@freebsd.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 21:28:25 -0000 Tbh if the results were from more than 6 months ago they are likely quite out of date as things have changed quite significantly in that period, so retesting would be advised. On Tuesday, 17 May 2016, Freddie Cash wrote: > On Tue, May 17, 2016 at 2:11 PM, Steven Hartland > wrote: > >> Raidz is limited essential limited to a single drive performance >> per dev for read and write while mirror is single drive performance for >> write its number of drives for read. Don't forget mirror is not limited = to >> two it can be three, four or more; so if you need more read throughput y= ou >> can add drives to the mirror. >> >> To increase raidz performance you need to add more vdevs. While this >> doesn't have to be double i.e. the same vdev config as the first it >> generally a good idea. >> >> Don't forget that while it rebalances write performance of a multi vdev >> raidz will be limited to the added vdev. >> > > =E2=80=8BEverybody is missing the point of the OP. > > They're not asking for ways to improve the performance of a raidz-based > pool; they're asking why they get different performance metrics from the > exact same pool when they change the CPU and RAM. > > And, more importantly, why a Core-i3-based system shows better performanc= e > than a Core-i7-based system.=E2=80=8B Is there something inherent to the= way ZFS > works that favours one setup over another (lower CPU core counts running = at > higher speeds is better/worse than higher CPU core counts running at lowe= r > speeds; more RAM channels is better/worse; things like that). > > > -- > Freddie Cash > fjwcash@gmail.com > From owner-freebsd-fs@freebsd.org Tue May 17 21:30:31 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 82A85B3F9D9 for ; Tue, 17 May 2016 21:30:31 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 70B661C8C for ; Tue, 17 May 2016 21:30:31 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 6C66FB3F9D7; Tue, 17 May 2016 21:30:31 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 6C128B3F9D6 for ; Tue, 17 May 2016 21:30:31 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 37D5D1C8B for ; Tue, 17 May 2016 21:30:30 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id BA0F3429C03; Wed, 18 May 2016 07:30:28 +1000 (AEST) Date: Wed, 18 May 2016 07:30:25 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Konstantin Belousov , fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160518061040.D5948@besplex.bde.org> Message-ID: <20160518070252.F6121@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=JdS-s63oAcdDXkgrDngA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 21:30:31 -0000 On Wed, 18 May 2016, Bruce Evans wrote: > On Tue, 17 May 2016, Konstantin Belousov wrote: >> ... >> For the accounting patch, don't we want to account for all io, including >> the mount-time metadata reads and initial superblock update ? >> >> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c >> index 9776554..712fc21 100644 >> --- a/sys/ufs/ffs/ffs_vfsops.c >> +++ b/sys/ufs/ffs/ffs_vfsops.c >> @@ -780,6 +780,8 @@ ffs_mountfs(devvp, mp, td) >> mp->mnt_iosize_max = MAXPHYS; >> >> devvp->v_bufobj.bo_ops = &ffs_ops; >> + if (devvp->v_type == VCHR) >> + devvp->v_rdev->si_mountpt = mp; >> >> fs = NULL; >> sblockloc = 0; >> @@ -1049,8 +1051,6 @@ ffs_mountfs(devvp, mp, td) >> ffs_flushfiles(mp, FORCECLOSE, td); >> goto out; >> } >> - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) >> - devvp->v_rdev->si_mountpt = mp; >> if (fs->fs_snapinum[0] != 0) >> ffs_snapshot_mount(mp); >> fs->fs_fmod = 1; >> @@ -1083,6 +1083,8 @@ ffs_mountfs(devvp, mp, td) >> out: >> if (bp) >> brelse(bp); >> + if (devvp->v_type == VCHR && devvp->v_rdev != NULL) >> + devvp->v_rdev->si_mountpt = NULL; >> if (cp != NULL) { >> DROP_GIANT(); >> g_topology_lock(); > > Yes, that looks better. Further cleanups: - the null pointer check is bogus since we already dereferenced devvp->v_rdev. We also assigned devvp->v_rdev to the variable dev but spelled out devvp->v_rdev in a couple of other places. - the VCHR check is bogus since we only work for VCHR and have already checked for VCHR in vn_isdisk(). Similarly in ffs_umount() except there is no dev variable there. Similarly in msdosfs. NOT similarly in ext2fs. I was looking at the wrong tree again. Only 1 of my trees has the patch to do this in ext2fs. The patch for ffs applies almost verbatim. Bruce From owner-freebsd-fs@freebsd.org Tue May 17 21:35:58 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A3D43B3FBEA for ; Tue, 17 May 2016 21:35:58 +0000 (UTC) (envelope-from fullermd@over-yonder.net) Received: from mail.infocus-llc.com (mail.infocus-llc.com [199.15.120.13]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8249F125C for ; Tue, 17 May 2016 21:35:58 +0000 (UTC) (envelope-from fullermd@over-yonder.net) Received: from draco.over-yonder.net (c-75-65-60-66.hsd1.ms.comcast.net [75.65.60.66]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.tarragon.infocus-llc.com (Postfix) with ESMTPSA id 3r8Vxf4z8LzX2; Tue, 17 May 2016 16:35:50 -0500 (CDT) Received: by draco.over-yonder.net (Postfix, from userid 100) id 3r8Vxd6xw5z1mp; Tue, 17 May 2016 16:35:49 -0500 (CDT) Date: Tue, 17 May 2016 16:35:49 -0500 From: "Matthew D. Fuller" To: Freddie Cash Cc: Steven Hartland , "freebsd-fs@freebsd.org" Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? Message-ID: <20160517213549.GK24656@over-yonder.net> References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Editor: vi X-OS: FreeBSD X-Virus-Scanned: clamav-milter 0.99 at mail.tarragon.infocus-llc.com X-Virus-Status: Clean User-Agent: Mutt/1.6.0-fullermd.4 (2016-04-01) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 21:35:58 -0000 On Tue, May 17, 2016 at 02:16:16PM -0700 I heard the voice of Freddie Cash, and lo! it spake thus: > > They're not asking for ways to improve the performance of a > raidz-based pool; they're asking why they get different performance > metrics from the exact same pool when they change the CPU and RAM. More specifically, as I read it, different performance in a very specific metric; single-thread linear bulk writes. That doesn't seem like it would benefit heavily from a lot of cores available, or from RAM bandwidth or size above a pretty low threshold. Of course, it's not just changing the CPU and RAM; it's also the motherboard, and possibly the HBA (at least the bus the HBA is on, if it's a card being transplanted with the pool). And the Core 2 would be back in the plain-old FSB era, so RAM access would be competing with the disk IO on the bus. -- Matthew Fuller (MF4839) | fullermd@over-yonder.net Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ On the Internet, nobody can hear you scream. From owner-freebsd-fs@freebsd.org Tue May 17 21:46:47 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5BD93B3FEA0 for ; Tue, 17 May 2016 21:46:47 +0000 (UTC) (envelope-from steven@multiplay.co.uk) Received: from mail-wm0-x22e.google.com (mail-wm0-x22e.google.com [IPv6:2a00:1450:400c:c09::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id DF2561777 for ; Tue, 17 May 2016 21:46:46 +0000 (UTC) (envelope-from steven@multiplay.co.uk) Received: by mail-wm0-x22e.google.com with SMTP id n129so158118761wmn.1 for ; Tue, 17 May 2016 14:46:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :cc; bh=HbfemyyWjoDRZ6770FQPy4PY02rYdxvUX21JKyZnTvE=; b=GBmlXcYH54XMqkc88djwby+crSG/B+62kv1kFDresvyb3AzZN1b/qMKSr2uMZvJwJL YxlHEU5fq/egqQaK4yZwyWW57eJ2hC38kNUnbwIacDwv9STRj5rB+Iphof+Ql2PmhuHK UCLEsyYuHSJ7T/PlRq0ZPhltu+6nTXKtIzMVgfCgticqe3bcLPPPXQKqgCTM7YN/UPbA tQyOxD27jAaJj3meff9faAi7qXiG5uz/GHhKJPNRE6u5GkmqO6OdaI8S9LLiZidaV7rH 4ZZAJKg0fn0+mNTLl67hjlJhQgRyesYoBRaSX0UccYrhVAXhqP29TnJchb0gyVN5iccV gKBA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:cc; bh=HbfemyyWjoDRZ6770FQPy4PY02rYdxvUX21JKyZnTvE=; b=bfRb0nqGwUUwi9aXA3Sofo5GDIdn0g41b9mziHbUtvUxe7BexUDfMKHVBHKLfqUCFB qXjptfESSGHB3/0ULr810pUVMWVeBREmafHv4GR3nIsSdNjXrRlRHMNgVFuuGhcJgs49 bihNITJyLa2VKa1vC/DZmRjtorDBrhep0a5gTk+3fh0SXqj+74JrnY86N/7Es8vEOFjz +6R8f/oZBqYfcZ0ZT++lGcnxu0yOCP9u7aZx0EIB/xDJmtZ1b6Dx5kpdipK6ccQcgeAD tMsDbYe60NLlp048EtRsigzYYfd+puL6c3CaR9gntwUslX6ZdRU3nsa5IxPvsmT2S21a Yw8g== X-Gm-Message-State: AOPr4FVmJMTjRjVUon74W+pMIx7aq310aFzp1+2lgvgS7RJSJEb0/Nns/ljgHHFOwJvsrwTcv47K3mzSx1s0KCHv MIME-Version: 1.0 X-Received: by 10.28.6.138 with SMTP id 132mr25022114wmg.60.1463521605006; Tue, 17 May 2016 14:46:45 -0700 (PDT) Received: by 10.28.93.203 with HTTP; Tue, 17 May 2016 14:46:44 -0700 (PDT) In-Reply-To: <20160517213549.GK24656@over-yonder.net> References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> <20160517213549.GK24656@over-yonder.net> Date: Tue, 17 May 2016 22:46:44 +0100 Message-ID: Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? From: Steven Hartland To: "Matthew D. Fuller" Cc: Freddie Cash , "freebsd-fs@freebsd.org" Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 21:46:47 -0000 Not to mention it's so easy to cripple performance with a bad bios setup this could easily be a simple setup issue. I had an issue the other day where a 4ghz Intel CPU couldn't process video transcode in real time which turned out to be a power saving option in the bios that was utterly destroying performance by running the CPU at 800Mhz instead of 4Ghz. Everything else seemed fine with nothing was using more then a few present of CPU. Disabling power saving fixed the issue. This issue was not present on a much lower power / older box simply because it didn't have the advanced power saving options. I'm not saying this was the case in these tests but simply providing a comcrete example that it's sometimes hard to get like for like comparisons even for what should be simple tests. On Tuesday, 17 May 2016, Matthew D. Fuller wrote: > On Tue, May 17, 2016 at 02:16:16PM -0700 I heard the voice of > Freddie Cash, and lo! it spake thus: > > > > They're not asking for ways to improve the performance of a > > raidz-based pool; they're asking why they get different performance > > metrics from the exact same pool when they change the CPU and RAM. > > More specifically, as I read it, different performance in a very > specific metric; single-thread linear bulk writes. That doesn't seem > like it would benefit heavily from a lot of cores available, or from > RAM bandwidth or size above a pretty low threshold. > > Of course, it's not just changing the CPU and RAM; it's also the > motherboard, and possibly the HBA (at least the bus the HBA is on, if > it's a card being transplanted with the pool). And the Core 2 would > be back in the plain-old FSB era, so RAM access would be competing > with the disk IO on the bus. > > > -- > Matthew Fuller (MF4839) | fullermd@over-yonder.net > Systems/Network Administrator | http://www.over-yonder.net/~fullermd/ > On the Internet, nobody can hear you scream. > From owner-freebsd-fs@freebsd.org Tue May 17 22:01:05 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DAE69B401E6 for ; Tue, 17 May 2016 22:01:05 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id C4AD21D50 for ; Tue, 17 May 2016 22:01:05 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id C3FEAB401E5; Tue, 17 May 2016 22:01:05 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C39D5B401E4 for ; Tue, 17 May 2016 22:01:05 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6D5F21D4F for ; Tue, 17 May 2016 22:01:05 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4HM0tdA016113 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 18 May 2016 01:00:56 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4HM0tdA016113 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4HM0t7E016110; Wed, 18 May 2016 01:00:55 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 18 May 2016 01:00:55 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs Message-ID: <20160517220055.GF89104@kib.kiev.ua> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160518070252.F6121@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 22:01:06 -0000 On Wed, May 18, 2016 at 07:30:25AM +1000, Bruce Evans wrote: > Further cleanups: > - the null pointer check is bogus since we already dereferenced > devvp->v_rdev. We also assigned devvp->v_rdev to the variable > dev but spelled out devvp->v_rdev in a couple of other places. > - the VCHR check is bogus since we only work for VCHR and have > already checked for VCHR in vn_isdisk(). No, these are not bogus. The checks are incorrect because they are racy, but they are needed with the proper locking. I intended to look at this tomorrow, since the fixes are not related to the current changes, but you forced me. VCHR check ensures that the devvp vnode is not reclaimed. I do not want to remove the check and rely on the caller of ffs_mountfs() to always do the right thing for it without unlocking devvp, this is too subtle. We are safe from devvp being reclaimed when io is in progress, since our reference prevents cdev memory from free, which ensures that v_rdev is valid if non-NULL. Unmount is not supposed to finish until all io is finished (but we had bugs there). > > Similarly in ffs_umount() except there is no dev variable there. There is ump->um_dev. diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index 712fc21..da61c15 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -771,17 +771,18 @@ ffs_mountfs(devvp, mp, td) error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); g_topology_unlock(); PICKUP_GIANT(); - VOP_UNLOCK(devvp, 0); - if (error) + if (error) { + VOP_UNLOCK(devvp, 0); goto out; - if (devvp->v_rdev->si_iosize_max != 0) + } + if (dev->si_iosize_max != 0) mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; if (mp->mnt_iosize_max > MAXPHYS) mp->mnt_iosize_max = MAXPHYS; - devvp->v_bufobj.bo_ops = &ffs_ops; if (devvp->v_type == VCHR) - devvp->v_rdev->si_mountpt = mp; + dev->si_mountpt = mp; + VOP_UNLOCK(devvp, 0); fs = NULL; sblockloc = 0; @@ -1083,8 +1084,10 @@ ffs_mountfs(devvp, mp, td) out: if (bp) brelse(bp); + VOP_LOCK(devvp, LK_EXCLUSIVE | LK_RETRY); if (devvp->v_type == VCHR && devvp->v_rdev != NULL) devvp->v_rdev->si_mountpt = NULL; + VOP_UNLOCK(devvp, 0); if (cp != NULL) { DROP_GIANT(); g_topology_lock(); @@ -1287,9 +1290,11 @@ ffs_unmount(mp, mntflags) g_vfs_close(ump->um_cp); g_topology_unlock(); PICKUP_GIANT(); - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) - ump->um_devvp->v_rdev->si_mountpt = NULL; - vrele(ump->um_devvp); + VOP_LOCK(ump->um_devvp, LK_EXCLUSIVE | LK_RETRY); + if (ump->um_devvp->v_type == VCHR && + ump->um_devvp->v_rdev == ump->um_dev) + ump->um_dev->si_mountpt = NULL; + vput(ump->um_devvp); dev_rel(ump->um_dev); mtx_destroy(UFS_MTX(ump)); if (mp->mnt_gjprovider != NULL) { From owner-freebsd-fs@freebsd.org Tue May 17 22:39:20 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C0AC2B409DB for ; Tue, 17 May 2016 22:39:20 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id ABFAD100E for ; Tue, 17 May 2016 22:39:20 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id A6FABB409D8; Tue, 17 May 2016 22:39:20 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A6AB3B409D7 for ; Tue, 17 May 2016 22:39:20 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail107.syd.optusnet.com.au (mail107.syd.optusnet.com.au [211.29.132.53]) by mx1.freebsd.org (Postfix) with ESMTP id 3FFCE1005 for ; Tue, 17 May 2016 22:39:19 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail107.syd.optusnet.com.au (Postfix) with ESMTPS id 76183D400F2; Wed, 18 May 2016 08:39:08 +1000 (AEST) Date: Wed, 18 May 2016 08:39:08 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs In-Reply-To: <20160517212227.GE89104@kib.kiev.ua> Message-ID: <20160518081302.X6396@besplex.bde.org> References: <20160517072705.F2157@besplex.bde.org> <20160517082050.GX89104@kib.kiev.ua> <20160517192933.U4573@besplex.bde.org> <20160517111715.GC89104@kib.kiev.ua> <20160518035413.L4357@besplex.bde.org> <20160518052656.R5764@besplex.bde.org> <20160517212227.GE89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=LHc1-K_XCWJFMkeID04A:9 a=AnwALAnBAE9uY5ff:21 a=85e_Wl8BSmVFEIn7:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 22:39:20 -0000 On Wed, 18 May 2016, Konstantin Belousov wrote: > On Wed, May 18, 2016 at 05:40:07AM +1000, Bruce Evans wrote: >> Also, ftruncate() seems to be broken. POSIX doesn't seem to require it >> to honor O_SYNC, but POLA requires this. But there is no VOP_TRUNCATE(); >> truncation is done using VOP_SETATTR() and there is no way to pass down >> the O_SYNC flag to it; in practice, ffs just does UFS_TRUNCATE() without >> IO_SYNC. >> >> This makes a difference mainly for async mounts with my fixes to honor >> IO_SYNC in ffs_update(). With async mounts, consistency of the file >> system is not guaranteed but O_SYNC for a file should at least cause >> all of the file data and most of its metdata to be written. Not syncing >> for ftruncate() unnecessarily loses metadata writes. With !async mounts, >> consistency of the file system is partly guaranteed and lost metadata >> writes for ftruncate() shouldn't affect this -- they should just lose >> the ftruncate() atomically. >> >> vfs could do an fsync() after VOP_SETATTR() for the O_SYNC case. This >> reduces the race window. > > vattr already has the va_vaflags field. It is trivial to add flag there > requesting O_SYNC behaviour. Of course, other updates could also > honour VA_SYNC, but this is for later. Like this: > > diff --git a/sys/kern/vfs_vnops.c b/sys/kern/vfs_vnops.c > index 0a3a88a..1e42a3d 100644 > --- a/sys/kern/vfs_vnops.c > +++ b/sys/kern/vfs_vnops.c > @@ -1314,6 +1314,8 @@ vn_truncate(struct file *fp, off_t length, struct ucred *active_cred, > if (error == 0) { > VATTR_NULL(&vattr); > vattr.va_size = length; > + if ((fp->f_flag & O_FSYNC) != 0) > + vattr.va_vaflags |= VA_SYNC; > error = VOP_SETATTR(vp, &vattr, fp->f_cred); > } > out: > diff --git a/sys/sys/vnode.h b/sys/sys/vnode.h > index e82f6ee..41ec7f7 100644 > --- a/sys/sys/vnode.h > +++ b/sys/sys/vnode.h > @@ -286,6 +286,7 @@ struct vattr { > */ > #define VA_UTIMES_NULL 0x01 /* utimes argument was NULL */ > #define VA_EXCLUSIVE 0x02 /* exclusive create request */ > +#define VA_SYNC 0x04 /* O_SYNC truncation */ > > /* > * Flags for ioflag. (high 16 bits used to ask for read-ahead and > diff --git a/sys/ufs/ufs/ufs_vnops.c b/sys/ufs/ufs/ufs_vnops.c > index c0729f8..83df347 100644 > --- a/sys/ufs/ufs/ufs_vnops.c > +++ b/sys/ufs/ufs/ufs_vnops.c > @@ -625,7 +625,8 @@ ufs_setattr(ap) > */ > return (0); > } > - if ((error = UFS_TRUNCATE(vp, vap->va_size, IO_NORMAL, > + if ((error = UFS_TRUNCATE(vp, vap->va_size, IO_NORMAL | > + ((vap->va_vaflags & VA_SYNC) != 0 ? IO_SYNC : 0), > cred)) != 0) > return (error); > } Looks good. O_SYNC is actually spelled O_FSYNC in FreeBSD. You spelled it correctly in the above, but about 2 places in the kernel use the POSIX spelling. It is confusing enough to also have the spellings FFSYNC and IO_SYNC for this flag in different layers. FFSYNC is for fcntl and must equal O_FSYNC since the layers are not clearly separated. IO_SYNC is for a clearly separated layer and is supposed to be translated to, but it has the same value as O_FSYNC so O_FSYNC might work accidentally when not translated. This should probably also be done for truncations with O_TRUNC at open time. There are a couple of these in vfs_syscalls.c. O_TRUNC is used much more than ftruncate() so the extra overhead from this would be more noticeable. I think the implementation is not very good. If open() with O_TRUNC or truncate with O_FSYNC or fsync() fails, then the file contents might be garbage. So it would be better to do large truncations mostly async and only sync at the end. *fs_truncate() could operate like that, but I think it takes the IO_SYNC flag as a directive to do the whole operation synchronously. A non-sync truncation followed by fsync() is likely to work better for ffs and just work for all fs's. Bruce From owner-freebsd-fs@freebsd.org Tue May 17 23:11:57 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 150EDB40E1A for ; Tue, 17 May 2016 23:11:57 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id F209A1D52 for ; Tue, 17 May 2016 23:11:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id F1559B40E19; Tue, 17 May 2016 23:11:56 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F0F75B40E18 for ; Tue, 17 May 2016 23:11:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 71FD11D51 for ; Tue, 17 May 2016 23:11:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4HNBpVC033011 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 18 May 2016 02:11:51 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4HNBpVC033011 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4HNBoea033010; Wed, 18 May 2016 02:11:50 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 18 May 2016 02:11:50 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: quick fix for slow directory shrinking in ffs Message-ID: <20160517231150.GG89104@kib.kiev.ua> References: <20160517072705.F2157@besplex.bde.org> <20160517082050.GX89104@kib.kiev.ua> <20160517192933.U4573@besplex.bde.org> <20160517111715.GC89104@kib.kiev.ua> <20160518035413.L4357@besplex.bde.org> <20160518052656.R5764@besplex.bde.org> <20160517212227.GE89104@kib.kiev.ua> <20160518081302.X6396@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160518081302.X6396@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 May 2016 23:11:57 -0000 On Wed, May 18, 2016 at 08:39:08AM +1000, Bruce Evans wrote: > Looks good. ... > > This should probably also be done for truncations with O_TRUNC at open > time. There are a couple of these in vfs_syscalls.c. O_TRUNC is used > much more than ftruncate() so the extra overhead from this would be > more noticeable. I think the implementation is not very good. If > open() with O_TRUNC or truncate with O_FSYNC or fsync() fails, then > the file contents might be garbage. So it would be better to do > large truncations mostly async and only sync at the end. *fs_truncate() > could operate like that, but I think it takes the IO_SYNC flag as a > directive to do the whole operation synchronously. A non-sync truncation > followed by fsync() is likely to work better for ffs and just work for > all fs's. I see only two places which calls fo_truncate() in vfs_syscalls.c, after O_TRUNC test. Both cases are after some kind of open, and the mechanism from my patch does synchronous truncation automatically for the callers. Of course, truncation errors from O_TRUNC in open are fatal, and the precious file (otherwise O_SYNC would be not specified at all) is in undefined and damaged state if that happens. From this point of view, O_TRUNC was bad idea. I looked at POSIX text, and while ftruncate(2) is allowed to return e.g. EIO, for open(2) EIO is not listed in case of truncation problems. I am not sure if generic rules of POSIX allow to say that the condition is undefined. Implementation cannot handle that without loss. From owner-freebsd-fs@freebsd.org Wed May 18 00:00:42 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C5721B3FD81 for ; Wed, 18 May 2016 00:00:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id AF019149C for ; Wed, 18 May 2016 00:00:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id A98A8B3FD7B; Wed, 18 May 2016 00:00:42 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A91AEB3FD78 for ; Wed, 18 May 2016 00:00:42 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 991681475 for ; Wed, 18 May 2016 00:00:40 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id C347F428F06; Wed, 18 May 2016 10:00:31 +1000 (AEST) Date: Wed, 18 May 2016 10:00:09 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160517220055.GF89104@kib.kiev.ua> Message-ID: <20160518084931.T6534@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=c+ZWOkJl c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=_FYEsZlHC8cxAzq_7eoA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 00:00:42 -0000 On Wed, 18 May 2016, Konstantin Belousov wrote: > On Wed, May 18, 2016 at 07:30:25AM +1000, Bruce Evans wrote: >> Further cleanups: >> - the null pointer check is bogus since we already dereferenced >> devvp->v_rdev. We also assigned devvp->v_rdev to the variable >> dev but spelled out devvp->v_rdev in a couple of other places. >> - the VCHR check is bogus since we only work for VCHR and have >> already checked for VCHR in vn_isdisk(). > No, these are not bogus. The checks are incorrect because they are > racy, but they are needed with the proper locking. I intended to look > at this tomorrow, since the fixes are not related to the current changes, > but you forced me. You are too efficient :-). > VCHR check ensures that the devvp vnode is not reclaimed. I do not want > to remove the check and rely on the caller of ffs_mountfs() to always do > the right thing for it without unlocking devvp, this is too subtle. Surely the caller must lock devvp? Otherwise none of the uses of devvp can be trusted, and there are several others. > We are safe from devvp being reclaimed when io is in progress, since > our reference prevents cdev memory from free, which ensures that v_rdev > is valid if non-NULL. Unmount is not supposed to finish until all io is > finished (but we had bugs there). >> >> Similarly in ffs_umount() except there is no dev variable there. > There is ump->um_dev. There is also ump->um_devvvp, but this seems to be unusable since it might go away. So using the devvp->v_rdev instead of the dev variable is not just a style bug. > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > index 712fc21..da61c15 100644 > --- a/sys/ufs/ffs/ffs_vfsops.c > +++ b/sys/ufs/ffs/ffs_vfsops.c > @@ -771,17 +771,18 @@ ffs_mountfs(devvp, mp, td) > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > g_topology_unlock(); > PICKUP_GIANT(); > - VOP_UNLOCK(devvp, 0); > - if (error) > + if (error) { > + VOP_UNLOCK(devvp, 0); > goto out; > - if (devvp->v_rdev->si_iosize_max != 0) > + } > + if (dev->si_iosize_max != 0) > mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; > if (mp->mnt_iosize_max > MAXPHYS) > mp->mnt_iosize_max = MAXPHYS; > - > devvp->v_bufobj.bo_ops = &ffs_ops; > if (devvp->v_type == VCHR) devvp must be still VCHR since this is now under the vnode lock, and we depend on dev remaining a character device for the disk described by devvp at the time of the vn_isdisk() check. > - devvp->v_rdev->si_mountpt = mp; > + dev->si_mountpt = mp; > + VOP_UNLOCK(devvp, 0); The unlocking could be a little earlier since dev is still for a disk even if devvp went away and you changed this to not used devvp->v_rdev. > > fs = NULL; > sblockloc = 0; Unlocking and then using devvp sure looks like a race. You only needed to move the unlocking to fix. devvp->v_bufobj. How does that work? The write is now locked, but if devvp goes away, then don't we lose its bufobj? > @@ -1083,8 +1084,10 @@ ffs_mountfs(devvp, mp, td) > out: > if (bp) > brelse(bp); > + VOP_LOCK(devvp, LK_EXCLUSIVE | LK_RETRY); > if (devvp->v_type == VCHR && devvp->v_rdev != NULL) > devvp->v_rdev->si_mountpt = NULL; > + VOP_UNLOCK(devvp, 0); > if (cp != NULL) { > DROP_GIANT(); > g_topology_lock(); Why not just dev->si_mountpt = NULL unconditionally? We must do this even if devvp went away, and we can easily do it using dev alone, as above. > @@ -1287,9 +1290,11 @@ ffs_unmount(mp, mntflags) > g_vfs_close(ump->um_cp); > g_topology_unlock(); > PICKUP_GIANT(); > - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) > - ump->um_devvp->v_rdev->si_mountpt = NULL; > - vrele(ump->um_devvp); > + VOP_LOCK(ump->um_devvp, LK_EXCLUSIVE | LK_RETRY); > + if (ump->um_devvp->v_type == VCHR && > + ump->um_devvp->v_rdev == ump->um_dev) > + ump->um_dev->si_mountpt = NULL; > + vput(ump->um_devvp); As above. We don't care if um_devvp went away, at least for clearing si_mountpt, and must use ump->um_dev to clear si_mountpt. > dev_rel(ump->um_dev); Presumably ump->um_dev was reference throughout until here, and this is the only thing keeping the device from going away too. > mtx_destroy(UFS_MTX(ump)); > if (mp->mnt_gjprovider != NULL) { > How does any use of ump->um_devvp work? I tried revoke(2) on the devvp of a mounted file system. This worked to give v_type = VBAD and v_rdev = NULL, but didn't crash. ffs_unmount() checked for the bad vnode, unlike most places, and failed to clear si_mountpt. Normal use doesn't have revokes, but if the vnode is reclaimed instead of just becoming bad, then worse things probably happen. I think vnode cache resizing gives very unstable storage so the pointer becomes very invalid. But even revoke followed by setting kern.numvnodes to 1 didn't crash (15 vnodes remained). So devvp must be referenced throughout. It seems to have reference count 2, since umounting reduced kern.numvnodes from 15 to 13. (It is surprising how much works with kern.maxvnodes=1. I was able to run revoke, sysctl and umount.) It is still a mystery that the VBAD vnode doesn't crash soon. Bruce From owner-freebsd-fs@freebsd.org Wed May 18 01:54:16 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DC984B40DD7 for ; Wed, 18 May 2016 01:54:16 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id C979B1E07 for ; Wed, 18 May 2016 01:54:16 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id C8CB4B40DD6; Wed, 18 May 2016 01:54:16 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C6308B40DD5 for ; Wed, 18 May 2016 01:54:16 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail108.syd.optusnet.com.au (mail108.syd.optusnet.com.au [211.29.132.59]) by mx1.freebsd.org (Postfix) with ESMTP id 871B01E06 for ; Wed, 18 May 2016 01:54:15 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail108.syd.optusnet.com.au (Postfix) with ESMTPS id 2CF791A6B44; Wed, 18 May 2016 11:54:11 +1000 (AEST) Date: Wed, 18 May 2016 11:54:02 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Konstantin Belousov , fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160518084931.T6534@besplex.bde.org> Message-ID: <20160518110928.Q6900@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=c+ZWOkJl c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=WiiwgCuxqNMNQUgj__EA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 01:54:17 -0000 On Wed, 18 May 2016, Bruce Evans wrote: > ... > How does any use of ump->um_devvp work? > > I tried revoke(2) on the devvp of a mounted file system. This worked > to give v_type = VBAD and v_rdev = NULL, but didn't crash. ffs_unmount() > checked for the bad vnode, unlike most places, and failed to clear > si_mountpt. > > Normal use doesn't have revokes, but if the vnode is reclaimed instead > of just becoming bad, then worse things probably happen. I think vnode > ... I still haven't generated a crash, but revoke certainly does one bad thing: it breaks detection of busy devices so that the same device can be mounted more than once. GEOM was supposed to allow multiple mounts for ro mounts, but this gave garbage pointers and is turned off. To turn it back on, use: % mount -o ro /dev/ad4s4a /i # my normal mount % mount -o ro /dev/ad4s4a /i # fails with EBUSY % revoke /dev/dev/ad4s4a % ls /i # seems to work % mount -o ro /dev/ad4s4a /i # doesn't fail; clobbers ptrs % ls /i # seems to work % umount /i # seems to work, but clobbers % ls /i # top of stack still there % umount /i # seems to work Crashes can probably be arranged by writing to the device after it is revoked. The device is supposed to be exclusive access or at least ro, but revoke breaks that. Or just put 2 independent valid file systems on the same device in advance or by writing, so as to clobber the pointers better. The exclusive access can also be broken using separate devfs instances: % mount -o ro /dev/ad4s4a /i % mkdir /tmp/dev % mount -t devfs devfs /tmp/dev # normal sort of use for jails? % mount -o rw /tmp/dev/ad4s4a /i # doesn't fail; can even be rw Perhaps this doesn't clobber pointers near bufobj as badly as the turned off code, but it certainly clobbers si_mountpt. Each new mount sets si_mountpt in the shared cdev struct. The first unmount sets this to NULL so I think it never points to garbage. It just points to the wrong mount struct or is turned off. The case of multiple devfs instances has a chance of working since devvp is separate so assigments to devvp->v_bufobj don't clobber previous mouts. I now remember that this prevented me finding a fix for the i/o counting. Multiple mounts were supposed to work, but obviously a single pointer in the cdev cannot work for multiple mounts. I think it was removed (breaking the i/o counting) because it was too hard to fix it to work even for a single mount (since allowing multiple mounts gives pointer clobbering problems). Bruce From owner-freebsd-fs@freebsd.org Wed May 18 04:24:09 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 71B54B402DE for ; Wed, 18 May 2016 04:24:09 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: from mx3.lexa.ru (ns503534.ip-198-27-68.net [198.27.68.102]) by mx1.freebsd.org (Postfix) with ESMTP id 53A8516EB for ; Wed, 18 May 2016 04:24:08 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: by mx3.lexa.ru (Postfix, from userid 66) id 9733C224A5D; Wed, 18 May 2016 00:24:07 -0400 (EDT) Received: from [193.124.130.166] (unknown [193.124.130.166]) by home-gw.lexa.ru (Postfix) with ESMTP id A176D1801 for ; Wed, 18 May 2016 07:21:46 +0300 (MSK) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? To: "freebsd-fs@freebsd.org" References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> From: Alex Tutubalin Message-ID: Date: Wed, 18 May 2016 07:21:46 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 04:24:09 -0000 On 5/18/2016 12:11 AM, Steven Hartland wrote: > Raidz is limited essential limited to a single drive performance > per dev for read and write while mirror is single drive performance > for write its number of drives for read. Don't forget mirror is not > limited to two it can be three, four or more; so if you need more read > throughput you can add drives to the mirror. Do I understand it correctly: - single write of one large file (or singe local write to zvol shared via iSCSI) will be local: single or only several metaslabs - for RAIDZ each disk will get only part of throughput - for mirror, each disk included in write will receive full data size (and for single local write only limited number of disks to be included in write) If so, raidz will have huge write performance benefit in my case: single write of one large file. As for read speed, I hope to deal with it with large enough L2ARC on SSDs. > > To increase raidz performance you need to add more vdevs. While this > doesn't have to be double i.e. the same vdev config as the first it > generally a good idea. Again, multiple vdevs will help for multiple parallel writes, but not for single one? Alex Tutubalin From owner-freebsd-fs@freebsd.org Wed May 18 06:10:24 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 93165B40B0D for ; Wed, 18 May 2016 06:10:24 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x22c.google.com (mail-wm0-x22c.google.com [IPv6:2a00:1450:400c:c09::22c]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 2A5411D8C for ; Wed, 18 May 2016 06:10:24 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x22c.google.com with SMTP id a17so62290187wme.0 for ; Tue, 17 May 2016 23:10:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=Qv5tOdb3utavt048KxW1XyVrzocu+YCEfUcA8NHz4g4=; b=aRx2z1VZ+tv8ftgMKdOwQnv1/CO2OMxGqmS8IAnBZcMPIyMp+iGzbVYlGAhm2DbBlS xFm5GuwiFzSkKoEMk1uflK8/8VuRF4n89zpoP7DBJ1QPVJBLKhY8HwfJ8kVTP8OvFGsq 5ItbzQes0HmYDOclafenJJngCEXuaN4n2ZCYAM49/dc8zgpOgI9lQWe4Xlf39T9tZPKO F0nDSt+eKiUyNwYxnfGhu+pHT/dz6hlb+2V6MM7k0ffg7TkjuBbeakl1yHurz79g1ECh 5xG6M7h1bxXNRAE7/xHR2Ny51R69RFGN0Pg7JToYnrHkhFLTtOfJSIIiVZOzMrCMrriX uJQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=Qv5tOdb3utavt048KxW1XyVrzocu+YCEfUcA8NHz4g4=; b=aEUB227kzOPHMChVgzF5D3pJq3sqvP1b2L1trzImc1AbUIhP5sliVV3jPS/favImL9 no2LhINsrARNfB8rEuDFLElu5MY63OOf9tN+DqlNf8+1tmPD2r3E/MojyCk5sVcsmovD iMA778enxgIECvV0TYZ07O9YBp2dCkuXPUvHwjf7knTMRtFSKRhQ2GvUS5Km6AtGndtT G3PD8/tIf6cqqCh0g70wIy7B+0p6XWVsaQjrO23VJM2h1SZ6BbxmS/vEIYZNqL2qOS7R zCHmy89sylx44+sSsA6a3rm0cj3glkK8pourn1U5Z2kVViReJ98jG0btLRif4rHBgidg S69Q== X-Gm-Message-State: AOPr4FWknu2MfbEsn3Wyu+SIpy2vcSZUoyPLsQak9ba/JPQF+lCUrMNEb+IMiRA0nmGY6Q== X-Received: by 10.194.203.227 with SMTP id kt3mr5054812wjc.73.1463551822537; Tue, 17 May 2016 23:10:22 -0700 (PDT) Received: from [192.168.0.1] (cag06-2-82-237-68-117.fbx.proxad.net. [82.237.68.117]) by smtp.gmail.com with ESMTPSA id b22sm7233476wmb.9.2016.05.17.23.10.21 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Tue, 17 May 2016 23:10:21 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: State of native encryption in ZFS From: Ben RUBSON In-Reply-To: <0CE6E456-CC25-4AED-A73E-F5BBE659F795@mail.turbofuzz.com> Date: Wed, 18 May 2016 08:10:20 +0200 Content-Transfer-Encoding: 7bit Message-Id: References: <5736E7B4.1000409@gmail.com> <0CE6E456-CC25-4AED-A73E-F5BBE659F795@mail.turbofuzz.com> To: freebsd-fs@freebsd.org X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 06:10:24 -0000 >> I wish to know somethign new about native encryption in ZFS for FreeBSD. >> Any works in this direction are conducted? > > Short and simple answer: No. However, look at this : https://github.com/zfsonlinux/zfs/pull/4329 Certainly something interesting ! Ben From owner-freebsd-fs@freebsd.org Wed May 18 06:43:36 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2AFC2B4046C for ; Wed, 18 May 2016 06:43:36 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from vps.rulingia.com (vps.rulingia.com [103.243.244.15]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "rulingia.com", Issuer "Let's Encrypt Authority X3" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id B1D6211EB for ; Wed, 18 May 2016 06:43:35 +0000 (UTC) (envelope-from peter@rulingia.com) Received: from server.rulingia.com (ppp59-167-167-3.static.internode.on.net [59.167.167.3]) by vps.rulingia.com (8.15.2/8.15.2) with ESMTPS id u4I6hJ2X001483 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Wed, 18 May 2016 16:43:25 +1000 (AEST) (envelope-from peter@rulingia.com) X-Bogosity: Ham, spamicity=0.000000 Received: from server.rulingia.com (localhost.rulingia.com [127.0.0.1]) by server.rulingia.com (8.15.2/8.15.2) with ESMTPS id u4I6hDvZ069784 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 18 May 2016 16:43:13 +1000 (AEST) (envelope-from peter@server.rulingia.com) Received: (from peter@localhost) by server.rulingia.com (8.15.2/8.15.2/Submit) id u4I6hBKq069783; Wed, 18 May 2016 16:43:11 +1000 (AEST) (envelope-from peter) Date: Wed, 18 May 2016 16:43:11 +1000 From: Peter Jeremy To: "Matthew D. Fuller" Cc: Freddie Cash , "freebsd-fs@freebsd.org" , Steven Hartland Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? Message-ID: <20160518064311.GA22800@server.rulingia.com> References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> <20160517213549.GK24656@over-yonder.net> MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="AqsLC8rIMeq19msA" Content-Disposition: inline In-Reply-To: <20160517213549.GK24656@over-yonder.net> X-PGP-Key: http://www.rulingia.com/keys/peter.pgp User-Agent: Mutt/1.6.1 (2016-04-27) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 06:43:36 -0000 --AqsLC8rIMeq19msA Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On 2016-May-17 16:35:49 -0500, "Matthew D. Fuller" wrote: >More specifically, as I read it, different performance in a very >specific metric; single-thread linear bulk writes. That doesn't seem >like it would benefit heavily from a lot of cores available, or from >RAM bandwidth or size above a pretty low threshold. Actually, whilst I presume the OP has compression disabled, ZFS can very effectively use multiple cores to compress data - even if it's only a single linear writer. >Of course, it's not just changing the CPU and RAM; it's also the >motherboard, and possibly the HBA (at least the bus the HBA is on, if >it's a card being transplanted with the pool). And the Core 2 would >be back in the plain-old FSB era, so RAM access would be competing >with the disk IO on the bus. Without knowing much more about the configuration of each system, it's impossible to identify where the bottleneck might be. --=20 Peter Jeremy --AqsLC8rIMeq19msA Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQJ8BAEBCgBmBQJXPA7/XxSAAAAAAC4AKGlzc3Vlci1mcHJAbm90YXRpb25zLm9w ZW5wZ3AuZmlmdGhob3JzZW1hbi5uZXRFRUIyOTg2QzMwNjcxRTc0RTY1QzIyN0Ux NkE1OTdBMEU0QTIwQjM0AAoJEBall6Dkogs0kAAP/39OaH2lcRztfM/7f2QeUrLZ zBqZUEV/s5KTAWFL/enJT0Jel7pVgpGN5JZ/zKwf+SgULNZUYRVMXrJ9+9f2EkJv gi/nWwj1KgI9zLTd5W1IhOCnLe/3WXqnj+skFcmDpI69eGHI+ndBP4T6ZCG2LPSZ HjKaERFPW+cqqRnLYKsuCEVB9gZixBjV0wCkrhXoTs4d6Ce7+uFk3Wbef3EA5h8U B3bBhSdfWRIeTiJZQzq4njUmQZqQ6MAdCWUPX/EIgC0n6f6rzm4BJde8mxVilpKG BfAAl1FuBTg3zG4zI3mJ4o7lpQoNGYYVoDFmcSOMXx+TCQ1Bug8GQjN8ComViIX/ yYLMkRS0+29FUvQhNydmSuxEwQGVJmKBZNFyJqVnLNVSbbbw3hnZOwaVBxTj0hF+ BOlGh7U0JEZj7sswfD3RssljBe3PHU9qG1FYT8OSn/ScMI0lnGvVvW6XuSaujjpv SehrSwvU9HD080gbOdsGZjVI8kVzhZoejxz/GCxOf7/xqQo78kkkAXJFAOMD2fl1 gDeBijmKk+Mzf9D2QBhxByBgtyQCIDzf68sT6LNdy5od8r05rP2PXRr9TIJjiJTx MUzpxhFByrB8rdDgx3hxDbQMLpNdeif2QQNxlk4dyv1e7ic88xTisMJgJcfaa2Gm FguZKkUB0sNFjyuenM2s =gsc7 -----END PGP SIGNATURE----- --AqsLC8rIMeq19msA-- From owner-freebsd-fs@freebsd.org Wed May 18 07:27:52 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5685FB403DA for ; Wed, 18 May 2016 07:27:52 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: from mail-wm0-x233.google.com (mail-wm0-x233.google.com [IPv6:2a00:1450:400c:c09::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id F20FE1FD0 for ; Wed, 18 May 2016 07:27:51 +0000 (UTC) (envelope-from ben.rubson@gmail.com) Received: by mail-wm0-x233.google.com with SMTP id n129so170672230wmn.1 for ; Wed, 18 May 2016 00:27:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=ZEmanaacv3nlhLHS8gNaBDDsFjRhWPwJB5yfIkpZz9w=; b=hUTSlDFFHWJtyr8v9jKgc5HTqT77NusTNptpGAk+Td1BciSWPbz1+JDsGhF+n7bv8x 2ouXewcseO/i7/XTYvx9Br0TC0Bk0RHfwIv6DqU4wreTp5hZoNFWK/QQqlgx7EiMiFkI HpAPDO5tiXn342wOWAefrwlmv/8ZawFs7VigE8/E886E2+U2PPrjyeYDorNGYNE0FIt4 yq2MEu2fdD/30285G9PHM3TpUW029cZQbo6CE5VhsvxidKRHMiRKsgurxYKxxWQdYNoV sNFliHcKbyGhH1fLTaciLckmwf5E4wBV73N1IZcijZXfPkQbZzrKfcSHr2X2fY0RG0R3 S8MQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date :content-transfer-encoding:message-id:references:to; bh=ZEmanaacv3nlhLHS8gNaBDDsFjRhWPwJB5yfIkpZz9w=; b=JImszdUKiXyUVd87jUFlqiXgs/kJutvb9Yg7IDTHmlcKmoj0KKlb02fMEUHReGkJuU E4lq9Z0p56CeH9+tNvzSkZYFy5F/a58WYOu2tVJXowtcJkfvd3ENl5uPkvTNypfSTcoo q54j76oYdhUjDeU8drFCJ/REhJ6Tk+mXfF5I53jMo8PLISeQrdMhFz2OQIyaNFN9U+2w 7x/aBFWVvLua6FF5t3mjgX3OXa8uxaA3N/HcNpbCblPYf7Q6FcWhvhOpYrNgr4UOlqT8 jGUlf1OzG+L9djm4pAT+Jn9xZACyEP7hGqymJAlVntmQr9iQwLdOfVMjBHVuFO/k17Ev PzMg== X-Gm-Message-State: AOPr4FUujMPYDcHNWVBPKFP+ishS3t6YckI7csrrAJCdYRDYY5/WmVZU3zw0W8cNzG71Ow== X-Received: by 10.28.215.197 with SMTP id o188mr5967190wmg.14.1463556470508; Wed, 18 May 2016 00:27:50 -0700 (PDT) Received: from [192.168.0.1] (cag06-2-82-237-68-117.fbx.proxad.net. [82.237.68.117]) by smtp.gmail.com with ESMTPSA id a75sm7532098wme.18.2016.05.18.00.27.49 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 18 May 2016 00:27:49 -0700 (PDT) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Best practice for high availability ZFS pool From: Ben RUBSON In-Reply-To: Date: Wed, 18 May 2016 09:27:48 +0200 Content-Transfer-Encoding: quoted-printable Message-Id: <5F874CA9-A8D9-4A09-A4BD-95466AB7D165@gmail.com> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <40C35566-B7FB-4F59-BB41-D43BC0362C26@gmail.com> To: freebsd-fs@freebsd.org X-Mailer: Apple Mail (2.3124) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 07:27:52 -0000 > On 17 may 2016 at 19:06, Bob Friesenhahn = wrote: >=20 > On Tue, 17 May 2016, Ben RUBSON wrote: >=20 >>> On 17 may 2016 at 15:24, Bob Friesenhahn = wrote: >>>=20 >>> There is at least one case of zfs send propagating a problem into = the receiving pool. I don't know if it broke the pool. Corrupt data may = be sent from one pool to another if it passes checksums. >>=20 >> Do you have any link to this problem ? Would be interesting to know = if it was possible to come-back to a previous snapshot / consistent = pool. >=20 > I don't have a link but I recall that it had something to do with the = ability to send file 'holes' in the stream. OK, just for reference : = https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D207714 >> I think that making ZFS send/receive has a higher security level than = mirroring to a second (or third) JBOD box. >> With mirroring you will still have only one ZFS pool. >=20 > This is a reasonable assumption. >=20 >> However, if send/receive makes the receiving pool the exact 1:1 copy = of the sending pool, then the thing which made the sending pool to = corrupt could reach (and corrupt) the receiving pool... I don't know = whether or not this could occur, and if ever it occurs, if we have the = chance to revert to a previous snapshot, at least on the receiving = side... >=20 > Zfs receive does not result in a 1:1 copy. The underlying data = organization can be completely different and compression or other = options can be changed. Yes, so if we assume ZFS send/receive bug-free, having a second pool = which receives data of the first one (mirrored to different JBOD boxes), = makes sense. For the first pool, we could think about the following : - server1 with its JBOD as a iSCSI target ; - server2 with the exact same JBOD, iSCSI initiator, hosts a ZFS pool = which mirrors each of server2's disks with one of the server1's disks. If ever server2 fails, server1 imports the pool and brings the service = back up. When server2 comes back, it acts as the new iSCSI target and gives its = disks to server1 which reconstructs the mirror. Disks redundancy, and hardware redundancy. And regularly, this pool is sent/received to a different pool on = server3, we never know... Sounds good (to me at least :) Ben= From owner-freebsd-fs@freebsd.org Wed May 18 07:53:29 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D03D2B40331 for ; Wed, 18 May 2016 07:53:29 +0000 (UTC) (envelope-from girgen@pingpong.net) Received: from mail.pingpong.net (mail.pingpong.net [79.136.116.202]) by mx1.freebsd.org (Postfix) with ESMTP id 9B87D1491; Wed, 18 May 2016 07:53:29 +0000 (UTC) (envelope-from girgen@pingpong.net) Received: from [10.22.157.15] (unknown [94.234.170.60]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.pingpong.net (Postfix) with ESMTPSA id 5DC1C16878; Wed, 18 May 2016 09:53:27 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (1.0) Subject: Re: Best practice for high availability ZFS pool From: Palle Girgensohn X-Mailer: iPhone Mail (13E238) In-Reply-To: <5DA13472-F575-4D3D-80B7-1BE371237CE5@getsomewhere.net> Date: Wed, 18 May 2016 09:53:26 +0200 Cc: Palle Girgensohn , freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: <8E674522-17F0-46AC-B494-F0053D87D2B0@pingpong.net> References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <5DA13472-F575-4D3D-80B7-1BE371237CE5@getsomewhere.net> To: Joe Love X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 07:53:29 -0000 > 17 maj 2016 kl. 18:13 skrev Joe Love : >=20 >=20 >> On May 16, 2016, at 5:08 AM, Palle Girgensohn wrote:= >>=20 >> Hi, >>=20 >> We need to set up a ZFS pool with redundance. The main goal is high avail= ability - uptime. >>=20 >> I can see a few of paths to follow. >>=20 >> 1. HAST + ZFS >>=20 >> 2. Some sort of shared storage, two machines sharing a JBOD box. >>=20 >> 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >>=20 >> 4. using something else than ZFS, even a different OS if required. >>=20 >> My main concern with HAST+ZFS is performance. Google offer some insights h= ere, I find mainly unsolved problems. Please share any success stories or ot= her experiences. >>=20 >> Shared storage still has a single point of failure, the JBOD box. Apart f= rom that, is there even any support for the kind of storage PCI cards that s= upport dual head for a storage box? I cannot find any. >>=20 >> We are running with ZFS replication today, but it is just too slow for th= e amount of data. >>=20 >> We prefer to keep ZFS as we already have a rather big (~30 TB) pool and a= lso tools, scripts, backup all is using ZFS, but if there is no solution usi= ng ZFS, we're open to alternatives. Nexenta springs to mind, but I believe i= t is using shared storage for redundance, so it does have single points of f= ailure? >>=20 >> Any other suggestions? Please share your experience. :) >>=20 >> Palle >=20 > I don=E2=80=99t know if this falls into the realm of what you want, but BS= DMag just released an issue with an article entitled =E2=80=9CAdding ZFS to t= he FreeBSD dual-controller storage concept.=E2=80=9D > https://bsdmag.org/download/reusing_openbsd/ >=20 > My understanding in this setup is that the only single point of failure fo= r this model is the backplanes that the drives would connect to. Depending o= n your controller cards, this could be alleviated by simply using multiple d= rive shelves, and only using one drive/shelf as part of a vdev (then stripe o= r whatnot over your vdevs). >=20 > It might not be what you=E2=80=99re after, as it=E2=80=99s basically two s= ystems with their own controllers, with a shared set of drives. Some expans= ion from the virtual world to real physical systems will probably need addit= ional variations. > I think the TrueNAS system (with HA) is setup similar to this, only withou= t the split between the drives being primarily handled by separate controlle= rs, but someone with more in-depth knowledge would need to confirm/deny this= . >=20 > -Jo Hi, Do you know any specific controllers that work with dual head? Thanks., Palle From owner-freebsd-fs@freebsd.org Wed May 18 08:02:01 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5AE76B40896 for ; Wed, 18 May 2016 08:02:01 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: from mail-wm0-x233.google.com (mail-wm0-x233.google.com [IPv6:2a00:1450:400c:c09::233]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id E057D18E7 for ; Wed, 18 May 2016 08:02:00 +0000 (UTC) (envelope-from killing@multiplay.co.uk) Received: by mail-wm0-x233.google.com with SMTP id r12so21754004wme.0 for ; Wed, 18 May 2016 01:02:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=multiplay-co-uk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to; bh=26hb0lp5WcDOWlKZd+27ZAvGaHKfswna5DuPsFPGHko=; b=q0Jxo0h4mtYVNfQr2pPmuc7vw+/biv32ehLgHGJHz+5+FdN79JW5N2DeodqKdtaU72 OX8pHXRG8uNYz9BYohtyzKxSLir+pPS2rLZtbG/LsCjJUSROY34a2wCmO7u3NVw9yLcn DCMArqjco6bhVYKen3A0AWaiMThJV/BFbm1qTuMPMI92rZNOjArmyjfD3NyluL+p/bpk 8XnaVvtsvmtgoD82BCJmTwyMi2qvFZcB0ylwWBoIbhC9TvIok3R6v6ftdi6sz9Odckoa Vvpw9R1sD/C4sLK6kJlZ04pZPex8M1P5LPDGpIKD2RcWIBQriralpERP7fJc9X4Q5Np7 gxUw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to; bh=26hb0lp5WcDOWlKZd+27ZAvGaHKfswna5DuPsFPGHko=; b=CbuUgs0GUHflFXCZH+19zhciUvQc+vBXHDr7hvxvJM5q2DwxpXd62jwwT/QBn+OIgB URxwwhzgFA/YmVlVqM93AUBL+gF7rGjW+NHfAv8wgx6Shy3hhlIA6hqW6HSgur4RdLEA sAfkJ2V3PV7t6MaAnsgF7xTT3r6KIaoYic9USLcHt+UNqIcE43l/PZ2LtdzdGDvWGe8S FljCAY1Yq0IkVujzGXQZwlNzlBxvYB1gODp0ZeAtgd2g/+q/Q8iZbdL6vJviMNmn6l1B erDTArXGlf4mTLeB2MkqnZxLX5DiOFdfyct6mZXhES1SYb+23fbJV5IteJnIw+vbreWr e1wA== X-Gm-Message-State: AOPr4FUVrakDx+YHxEtJqDTbk98DawlbRHJXG9Fo9r3+azF7od5IB2O+htPkgo6IOQodqec9 X-Received: by 10.28.111.14 with SMTP id k14mr6000521wmc.32.1463558518998; Wed, 18 May 2016 01:01:58 -0700 (PDT) Received: from [10.10.1.58] (liv3d.labs.multiplay.co.uk. [82.69.141.171]) by smtp.gmail.com with ESMTPSA id w9sm28180175wme.19.2016.05.18.01.01.57 for (version=TLSv1/SSLv3 cipher=OTHER); Wed, 18 May 2016 01:01:57 -0700 (PDT) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? To: freebsd-fs@freebsd.org References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> From: Steven Hartland Message-ID: <39be913e-32a5-2120-fee5-4521b8b95d80@multiplay.co.uk> Date: Wed, 18 May 2016 09:02:03 +0100 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 08:02:01 -0000 My comment was targeted under the assumption of random IOPs workload, which is typically the case, where each RAIDZ group (vdev) will give approximately a single drive performance. For a pretty definitive guide / answer see: http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ There's also some useful practical test results here: https://calomel.org/zfs_raid_speed_capacity.html On 18/05/2016 05:21, Alex Tutubalin wrote: > On 5/18/2016 12:11 AM, Steven Hartland wrote: >> Raidz is limited essential limited to a single drive performance per >> dev for read and write while mirror is single drive performance for >> write its number of drives for read. Don't forget mirror is not >> limited to two it can be three, four or more; so if you need more >> read throughput you can add drives to the mirror. > > Do I understand it correctly: > > - single write of one large file (or singe local write to zvol shared > via iSCSI) will be local: single or only several metaslabs > > - for RAIDZ each disk will get only part of throughput > > - for mirror, each disk included in write will receive full data size > (and for single local write only limited number of disks to be > included in write) > > If so, raidz will have huge write performance benefit in my case: > single write of one large file. > > As for read speed, I hope to deal with it with large enough L2ARC on > SSDs. > > >> >> To increase raidz performance you need to add more vdevs. While this >> doesn't have to be double i.e. the same vdev config as the first it >> generally a good idea. > > Again, multiple vdevs will help for multiple parallel writes, but not > for single one? > > Alex Tutubalin > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" From owner-freebsd-fs@freebsd.org Wed May 18 08:02:13 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E88C2B408B3 for ; Wed, 18 May 2016 08:02:13 +0000 (UTC) (envelope-from jg@internetx.com) Received: from mx1.internetx.com (mx1.internetx.com [62.116.129.39]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 76A29199F for ; Wed, 18 May 2016 08:02:13 +0000 (UTC) (envelope-from jg@internetx.com) Received: from localhost (localhost [127.0.0.1]) by mx1.internetx.com (Postfix) with ESMTP id BD3F345FC0D8; Wed, 18 May 2016 10:02:04 +0200 (CEST) X-Virus-Scanned: InterNetX GmbH amavisd-new at ix-mailer.internetx.de Received: from mx1.internetx.com ([62.116.129.39]) by localhost (ix-mailer.internetx.de [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id JYzL3Wdxyf6l; Wed, 18 May 2016 10:02:02 +0200 (CEST) Received: from [192.168.100.26] (pizza.internetx.de [62.116.129.3]) (using TLSv1 with cipher AES128-SHA (128/128 bits)) (No client certificate requested) by mx1.internetx.com (Postfix) with ESMTPSA id 343BA4C4C5E9; Wed, 18 May 2016 10:02:02 +0200 (CEST) Subject: Re: Best practice for high availability ZFS pool References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <5DA13472-F575-4D3D-80B7-1BE371237CE5@getsomewhere.net> <8E674522-17F0-46AC-B494-F0053D87D2B0@pingpong.net> To: Joe Love Cc: freebsd-fs@freebsd.org Reply-To: jg@internetx.com From: InterNetX - Juergen Gotteswinter Message-ID: <361f80cb-c7e2-18f6-ad62-f6f91aa7c293@internetx.com> Date: Wed, 18 May 2016 10:02:00 +0200 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <8E674522-17F0-46AC-B494-F0053D87D2B0@pingpong.net> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 08:02:14 -0000 Am 5/18/2016 um 9:53 AM schrieb Palle Girgensohn: > > >> 17 maj 2016 kl. 18:13 skrev Joe Love : >> >> >>> On May 16, 2016, at 5:08 AM, Palle Girgensohn wrote: >>> >>> Hi, >>> >>> We need to set up a ZFS pool with redundance. The main goal is high availability - uptime. >>> >>> I can see a few of paths to follow. >>> >>> 1. HAST + ZFS >>> >>> 2. Some sort of shared storage, two machines sharing a JBOD box. >>> >>> 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >>> >>> 4. using something else than ZFS, even a different OS if required. >>> >>> My main concern with HAST+ZFS is performance. Google offer some insights here, I find mainly unsolved problems. Please share any success stories or other experiences. >>> >>> Shared storage still has a single point of failure, the JBOD box. Apart from that, is there even any support for the kind of storage PCI cards that support dual head for a storage box? I cannot find any. >>> >>> We are running with ZFS replication today, but it is just too slow for the amount of data. >>> >>> We prefer to keep ZFS as we already have a rather big (~30 TB) pool and also tools, scripts, backup all is using ZFS, but if there is no solution using ZFS, we're open to alternatives. Nexenta springs to mind, but I believe it is using shared storage for redundance, so it does have single points of failure? >>> >>> Any other suggestions? Please share your experience. :) >>> >>> Palle >> >> I don’t know if this falls into the realm of what you want, but BSDMag just released an issue with an article entitled “Adding ZFS to the FreeBSD dual-controller storage concept.” >> https://bsdmag.org/download/reusing_openbsd/ >> >> My understanding in this setup is that the only single point of failure for this model is the backplanes that the drives would connect to. Depending on your controller cards, this could be alleviated by simply using multiple drive shelves, and only using one drive/shelf as part of a vdev (then stripe or whatnot over your vdevs). >> >> It might not be what you’re after, as it’s basically two systems with their own controllers, with a shared set of drives. Some expansion from the virtual world to real physical systems will probably need additional variations. >> I think the TrueNAS system (with HA) is setup similar to this, only without the split between the drives being primarily handled by separate controllers, but someone with more in-depth knowledge would need to confirm/deny this. >> >> -Jo > > Hi, > > Do you know any specific controllers that work with dual head? > > Thanks., > Palle go for lsi sas2008 based hba > > > _______________________________________________ > freebsd-fs@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-fs > To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" > From owner-freebsd-fs@freebsd.org Wed May 18 08:28:11 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C8D19B3F207 for ; Wed, 18 May 2016 08:28:11 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: from mx3.lexa.ru (ns503534.ip-198-27-68.net [198.27.68.102]) by mx1.freebsd.org (Postfix) with ESMTP id AA3561D80 for ; Wed, 18 May 2016 08:28:11 +0000 (UTC) (envelope-from lexa@lexa.ru) Received: by mx3.lexa.ru (Postfix, from userid 66) id 7AD4C224A5C; Wed, 18 May 2016 04:28:10 -0400 (EDT) Received: from [193.124.130.166] (unknown [193.124.130.166]) by home-gw.lexa.ru (Postfix) with ESMTP id 2C0D61CAF for ; Wed, 18 May 2016 11:26:06 +0300 (MSK) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? To: freebsd-fs@freebsd.org References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> <39be913e-32a5-2120-fee5-4521b8b95d80@multiplay.co.uk> From: Alex Tutubalin Message-ID: <411166e6-239f-0bf2-99df-e177f334270c@lexa.ru> Date: Wed, 18 May 2016 11:26:06 +0300 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <39be913e-32a5-2120-fee5-4521b8b95d80@multiplay.co.uk> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 08:28:11 -0000 On 5/18/2016 11:02 AM, Steven Hartland wrote: > My comment was targeted under the assumption of random IOPs workload, > which is typically the case, where each RAIDZ group (vdev) will give > approximately a single drive performance. For a pretty definitive > guide / answer see: > http://blog.delphix.com/matt/2014/06/06/zfs-stripe-width/ Thank you for the link. In my workload (single write stream) IOPs count is very low, disk write locality is good (each file most likely to fit in single metaslab), so bandwidth is not limited to single drive bandwidth. My current box (6x 7200rpm HDDs in raidz1) provides about 430 Mb/s write bandwidth over SMB link and about 500Mb/s for local writes. It is ~100 Mb/s per spindle, close enough to expected. I hope, I'll see 2x in bandwidth with 2x spindle count if I do not hit another performance limiter. So, my initial question was 'is there any known raidz performance limiter, like CPU or RAM speed/latency'. > > There's also some useful practical test results here: > https://calomel.org/zfs_raid_speed_capacity.html I've already posted this link in my thread-starting message :) And, yes, there are very strange similarity in both read and write speed in 6x and 10x SSD/raidz2 cases. Unfortunately, this benchmark is not real use case because of: "Since the disk cache can artificially inflate the results we choose to disable drive caches completely using Bonnie++ in synchronous test mode only." Synchronous mode will result in double writes (ZIL, than data), without separate ZIL device ZIL to be written to the main pool. We do not know what will happen with real-life async writes on same hardware. Alex Tutubalin From owner-freebsd-fs@freebsd.org Wed May 18 09:28:53 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D071EB3D72B for ; Wed, 18 May 2016 09:28:53 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from smtp.rlwinm.de (smtp.rlwinm.de [148.251.233.239]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 8A2CD1BB4 for ; Wed, 18 May 2016 09:28:53 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from crest.local (unknown [87.253.189.132]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.rlwinm.de (Postfix) with ESMTPSA id AA7C86E14 for ; Wed, 18 May 2016 11:28:50 +0200 (CEST) Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? To: freebsd-fs@freebsd.org References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> From: Jan Bramkamp Message-ID: Date: Wed, 18 May 2016 11:28:49 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 09:28:53 -0000 On 17/05/16 14:00, Alex Tutubalin wrote: > Hi, > > I'm new to the list, sorry if the subject was discussed earlier (for > many times), just point to archives.... > > I'm building new storage server for 'linear read/linear write' > performance with limited number of parallel data streams (load like > read/write multi-gigabyte photoshop files, or read many large raw photo > files). > Target is to saturate 10G link using SMB or iSCSI. > > Several years ago I've tested small zpool (5x3Tb 7200rpm drives in > RAIDZ) with different CPU/memory combos and have got these results for > linear write speed by big chunks: > > 440 Mb/sec with Core i3-2120/DDR3-1600 ram (2 channel) > 360 Mb/sec with core i7-920/DDR3-1333 (3 channel RAM) > 280 Mb/sec with Core 2Q Q9300 /DDR2-800 (2 channel) > > Mixed thoughts: i7-920 is fastest of the three, RAM linear access also > fastest, but beaten by i3-2120 with lower latency memory. > > Also, I've found this link: > https://calomel.org/zfs_raid_speed_capacity.html > For 6x SSD and 10x SSD in RAIDZ2, there is very similar read speed > (1.7Gb/sec) and very close in write speed (721/806 Mb/sec for 6/10 drives). > > Assuming HBA/PCIe performance to be very same for read and write > operations, write speed is not limited by HBA/bus... so it is limited by > what? CPU or RAM or ...? > > So, my question is 'what CPU/memory is optimal for ZFS performance'? > > In particular: > - DDR3 or DDR4 (twice the bandwidth) ? > - limited number of cores and high clock rate (e.g. i3-6xxxx) or many > cores/slower clock ? > > No plans to use compression or deduplication, only raidz2 with 8-10 HDD > spindles and 3-4-5 SSDs for L2ARC. Don't forget that you're not just benchmarking CPUs. You're measuring whole systems with different disk controllers, memory controllers, interrupt routing etc. For example the Core 2 CPU is limited by its old design putting the memory controllers into the northbridge. Maybe you can reduce some of the differences by using the same PCI-e SAS HBA in each system. From owner-freebsd-fs@freebsd.org Wed May 18 10:38:22 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 8925EB4026B for ; Wed, 18 May 2016 10:38:22 +0000 (UTC) (envelope-from girgen@pingpong.net) Received: from mail.pingpong.net (mail.pingpong.net [79.136.116.202]) by mx1.freebsd.org (Postfix) with ESMTP id 1D4231532 for ; Wed, 18 May 2016 10:38:21 +0000 (UTC) (envelope-from girgen@pingpong.net) Received: from [10.226.149.205] (80-254-69-13.dynamic.monzoon.net [80.254.69.13]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by mail.pingpong.net (Postfix) with ESMTPSA id F199316C12; Wed, 18 May 2016 12:38:20 +0200 (CEST) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (1.0) Subject: Re: Best practice for high availability ZFS pool From: Palle Girgensohn X-Mailer: iPhone Mail (13E238) In-Reply-To: <5127A334-0805-46B8-9CD9-FD8585CB84F3@chittenden.org> Date: Wed, 18 May 2016 12:38:20 +0200 Cc: freebsd-fs@freebsd.org Content-Transfer-Encoding: quoted-printable Message-Id: References: <5E69742D-D2E0-437F-B4A9-A71508C370F9@FreeBSD.org> <5DA13472-F575-4D3D-80B7-1BE371237CE5@getsomewhere.net> <8E674522-17F0-46AC-B494-F0053D87D2B0@pingpong.net> <5127A334-0805-46B8-9CD9-FD8585CB84F3@chittenden.org> To: Sean Chittenden X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 10:38:22 -0000 > 18 maj 2016 kl. 09:58 skrev Sean Chittenden : >=20 > https://www.freebsdfoundation.org/wp-content/uploads/2015/12/vol2_no4_grou= pon.pdf >=20 > mps(4) was good to us. What=E2=80=99s your workload? -sc Have to check details for peaks but average is around 0.8 MByte/s. Not much.= It will grow.=20 >=20 > -- > Sean Chittenden > sean@chittenden.org >=20 >=20 >> On May 18, 2016, at 03:53 , Palle Girgensohn wrote:= >>=20 >>=20 >>=20 >>> 17 maj 2016 kl. 18:13 skrev Joe Love : >>>=20 >>>=20 >>>> On May 16, 2016, at 5:08 AM, Palle Girgensohn wrot= e: >>>>=20 >>>> Hi, >>>>=20 >>>> We need to set up a ZFS pool with redundance. The main goal is high ava= ilability - uptime. >>>>=20 >>>> I can see a few of paths to follow. >>>>=20 >>>> 1. HAST + ZFS >>>>=20 >>>> 2. Some sort of shared storage, two machines sharing a JBOD box. >>>>=20 >>>> 3. ZFS replication (zfs snapshot + zfs send | ssh | zfs receive) >>>>=20 >>>> 4. using something else than ZFS, even a different OS if required. >>>>=20 >>>> My main concern with HAST+ZFS is performance. Google offer some insight= s here, I find mainly unsolved problems. Please share any success stories or= other experiences. >>>>=20 >>>> Shared storage still has a single point of failure, the JBOD box. Apart= from that, is there even any support for the kind of storage PCI cards that= support dual head for a storage box? I cannot find any. >>>>=20 >>>> We are running with ZFS replication today, but it is just too slow for t= he amount of data. >>>>=20 >>>> We prefer to keep ZFS as we already have a rather big (~30 TB) pool and= also tools, scripts, backup all is using ZFS, but if there is no solution u= sing ZFS, we're open to alternatives. Nexenta springs to mind, but I believe= it is using shared storage for redundance, so it does have single points of= failure? >>>>=20 >>>> Any other suggestions? Please share your experience. :) >>>>=20 >>>> Palle >>>=20 >>> I don=E2=80=99t know if this falls into the realm of what you want, but B= SDMag just released an issue with an article entitled =E2=80=9CAdding ZFS to= the FreeBSD dual-controller storage concept.=E2=80=9D >>> https://bsdmag.org/download/reusing_openbsd/ >>>=20 >>> My understanding in this setup is that the only single point of failure f= or this model is the backplanes that the drives would connect to. Depending= on your controller cards, this could be alleviated by simply using multiple= drive shelves, and only using one drive/shelf as part of a vdev (then strip= e or whatnot over your vdevs). >>>=20 >>> It might not be what you=E2=80=99re after, as it=E2=80=99s basically two= systems with their own controllers, with a shared set of drives. Some expa= nsion from the virtual world to real physical systems will probably need add= itional variations. >>> I think the TrueNAS system (with HA) is setup similar to this, only with= out the split between the drives being primarily handled by separate control= lers, but someone with more in-depth knowledge would need to confirm/deny th= is. >>>=20 >>> -Jo >>=20 >> Hi, >>=20 >> Do you know any specific controllers that work with dual head? >>=20 >> Thanks., >> Palle >>=20 >>=20 >> _______________________________________________ >> freebsd-fs@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscribe@freebsd.org" >=20 From owner-freebsd-fs@freebsd.org Wed May 18 11:08:41 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E12B9B40B8D for ; Wed, 18 May 2016 11:08:41 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id CA015189E for ; Wed, 18 May 2016 11:08:41 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id C9536B40B8C; Wed, 18 May 2016 11:08:41 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C6C93B40B8B for ; Wed, 18 May 2016 11:08:41 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 6A90A189D for ; Wed, 18 May 2016 11:08:41 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4IB8ZdA002871 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Wed, 18 May 2016 14:08:35 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4IB8ZdA002871 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4IB8YXl002870; Wed, 18 May 2016 14:08:34 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 18 May 2016 14:08:34 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs Message-ID: <20160518110834.GJ89104@kib.kiev.ua> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160518084931.T6534@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 11:08:42 -0000 On Wed, May 18, 2016 at 10:00:09AM +1000, Bruce Evans wrote: > On Wed, 18 May 2016, Konstantin Belousov wrote: > > VCHR check ensures that the devvp vnode is not reclaimed. I do not want > > to remove the check and rely on the caller of ffs_mountfs() to always do > > the right thing for it without unlocking devvp, this is too subtle. > > Surely the caller must lock devvp? Otherwise none of the uses of devvp > can be trusted, and there are several others. It must lock, but the interface of ffs_mountfs() would then require that there is no relock between vn_isdisk() check and call. I think I know how to make a good compromise there. I converted the check for VCHR into the assert. > There is also ump->um_devvvp, but this seems to be unusable since it > might go away. Go away as in being reclaimed, yes. The vnode itself is there, since we keep a reference. > > So using the devvp->v_rdev instead of the dev variable is not just a > style bug. Might be. > > devvp->v_bufobj.bo_ops = &ffs_ops; > > if (devvp->v_type == VCHR) > > devvp must be still VCHR since this is now under the vnode lock, and we > depend on dev remaining a character device for the disk described by > devvp at the time of the vn_isdisk() check. Unless relocked. > > > - devvp->v_rdev->si_mountpt = mp; > > + dev->si_mountpt = mp; > > + VOP_UNLOCK(devvp, 0); > > The unlocking could be a little earlier since dev is still for a disk even > if devvp went away and you changed this to not used devvp->v_rdev. > > > > > fs = NULL; > > sblockloc = 0; > > Unlocking and then using devvp sure looks like a race. > > You only needed to move the unlocking to fix. devvp->v_bufobj. How does > that work? The write is now locked, but if devvp goes away, then don't > we lose its bufobj? The buffer queues are flushed, and BO_DEAD flag is set. But the flag does very little. > How does any use of ump->um_devvp work? > > I tried revoke(2) on the devvp of a mounted file system. This worked > to give v_type = VBAD and v_rdev = NULL, but didn't crash. ffs_unmount() > checked for the bad vnode, unlike most places, and failed to clear > si_mountpt. > > Normal use doesn't have revokes, but if the vnode is reclaimed instead > of just becoming bad, then worse things probably happen. I think vnode > cache resizing gives very unstable storage so the pointer becomes very > invalid. But even revoke followed by setting kern.numvnodes to 1 didn't > crash (15 vnodes remained). So devvp must be referenced throughout. > It seems to have reference count 2, since umounting reduced kern.numvnodes > from 15 to 13. (It is surprising how much works with kern.maxvnodes=1. > I was able to run revoke, sysctl and umount.) It is still a mystery > that the VBAD vnode doesn't crash soon. > I believe that bo_ops assignment is the reason why UFS mounts survive the reclamation of the devvp vnode. Take a look at the ffs_geom_strategy(), which is the place where UFS io is tunneled directly into geom. It does not pass io requests through devfs. As result, revocation does not change much except doing unneccessary buf queue flush. It might be telling to try the same experiment, as conducted in your next message, on msdosfs instead of UFS. Below is the simplified patch. diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index 712fc21..412b000 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -764,6 +764,7 @@ ffs_mountfs(devvp, mp, td) cred = td ? td->td_ucred : NOCRED; ronly = (mp->mnt_flag & MNT_RDONLY) != 0; + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); dev = devvp->v_rdev; dev_ref(dev); DROP_GIANT(); @@ -771,17 +772,17 @@ ffs_mountfs(devvp, mp, td) error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); g_topology_unlock(); PICKUP_GIANT(); - VOP_UNLOCK(devvp, 0); - if (error) + if (error != 0) { + VOP_UNLOCK(devvp, 0); goto out; - if (devvp->v_rdev->si_iosize_max != 0) - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; + } + if (dev->si_iosize_max != 0) + mp->mnt_iosize_max = dev->si_iosize_max; if (mp->mnt_iosize_max > MAXPHYS) mp->mnt_iosize_max = MAXPHYS; - devvp->v_bufobj.bo_ops = &ffs_ops; - if (devvp->v_type == VCHR) - devvp->v_rdev->si_mountpt = mp; + dev->si_mountpt = mp; + VOP_UNLOCK(devvp, 0); fs = NULL; sblockloc = 0; @@ -1083,8 +1084,7 @@ ffs_mountfs(devvp, mp, td) out: if (bp) brelse(bp); - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) - devvp->v_rdev->si_mountpt = NULL; + dev->si_mountpt = NULL; if (cp != NULL) { DROP_GIANT(); g_topology_lock(); @@ -1287,8 +1287,7 @@ ffs_unmount(mp, mntflags) g_vfs_close(ump->um_cp); g_topology_unlock(); PICKUP_GIANT(); - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) - ump->um_devvp->v_rdev->si_mountpt = NULL; + ump->um_dev->si_mountpt = NULL; vrele(ump->um_devvp); dev_rel(ump->um_dev); mtx_destroy(UFS_MTX(ump)); From owner-freebsd-fs@freebsd.org Wed May 18 13:45:30 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B45ACB41F8D for ; Wed, 18 May 2016 13:45:30 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from citapm.icyb.net.ua (citapm.icyb.net.ua [212.40.38.140]) by mx1.freebsd.org (Postfix) with ESMTP id B8D231C69 for ; Wed, 18 May 2016 13:45:29 +0000 (UTC) (envelope-from avg@FreeBSD.org) Received: from porto.starpoint.kiev.ua (porto-e.starpoint.kiev.ua [212.40.38.100]) by citapm.icyb.net.ua (8.8.8p3/ICyb-2.3exp) with ESMTP id QAA20283 for ; Wed, 18 May 2016 16:45:27 +0300 (EEST) (envelope-from avg@FreeBSD.org) Received: from localhost ([127.0.0.1]) by porto.starpoint.kiev.ua with esmtp (Exim 4.34 (FreeBSD)) id 1b31mx-000KoE-E0 for freebsd-fs@FreeBSD.org; Wed, 18 May 2016 16:45:27 +0300 Subject: Fwd: ZFS Encryption Implementation for Review References: To: freebsd-fs From: Andriy Gapon X-Forwarded-Message-Id: Message-ID: <1ea0d65f-fc7d-f472-ce0a-f3c74bf08d77@FreeBSD.org> Date: Wed, 18 May 2016 16:44:31 +0300 User-Agent: Mozilla/5.0 (X11; FreeBSD amd64; rv:45.0) Gecko/20100101 Thunderbird/45.1.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 13:45:30 -0000 Just in case people overlooked this information in another thread here. -------- Forwarded Message -------- Subject: [developer] ZFS Encryption Implementation for Review Date: Tue, 17 May 2016 17:17:53 -0400 From: Thomas Caputi To: developer@open-zfs.org I have created an implementation for native encryption in ZFS. This implementation is currently available as a PR against ZoL (https://github.com/zfsonlinux/zfs/pull/4329). I would appreciate it if this PR could receive a review for consideration. For convenience, I have pasted the PR's description below. Thanks, Tom Caputi Native encryption in zfsonlinux (See issue #494) The change incorporates 3 major pieces: The first is a port of the Illumos Crypto Framework to a Linux kernel module (found in module/icp). This is needed to do the actual encryption work. We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules. Having the ICP also means the crypto code can run on any of the other kernels under OpenZFS. I ended up porting over most of the internals of the framework, which means that porting over other API calls (if we need them) should be fairly easy. Specifically, I have ported over the API functions related to encryption, digests, macs, and crypto templates. The ICP is able to use assembly-accelerated encryption on amd64 machines and AES-NI instructions on Intel chips that support it. There are place-holder directories for similar assembly optimizations for other architectures (although they have not been written). The second feature is a keystore that manages wrapping and encryption keys for encrypted datasets. It has feature parity with Solaris, but should be more predictable and consistent. It is fully integrated with the existing zfs create functions and zfs clone functions. It also exposes a new set of commands via zfs key for managing the keystore. For more info on the inconsistencies in Solaris see my comment (https://github.com/zfsonlinux/zfs/issues/494#issuecomment-178853634) on the issue page. The keystore operates on a few rules: All wrapping keys are 32 bytes (256 bits), even for 128 and 192 bit encryption types. Encryption must be specified at dataset creation time. Specifying a keysource while creating a dataset causes the dataset to become the root of an encryption tree. All members of an encryption tree share the same wrapping key. Each dataset can have up to 1 keychain (if it is encrypted) that is not shared with anybody. The last feature is the actual data and metadata encryption. All data in an encrypted dataset is stored encrypted on-disk. User-provided metadata is also encrypted, but metadata structures have been left plain so that scrubbing and resilvering still works without the keys loaded. Most of the design comes from this article (https://blogs.oracle.com/darren/entry/zfs_encryption_what_is_on). There are a few important distinctions, however. For instance, I store the encryption IV in the padding of blkptr_t instead of in its third DVA. I also have L2ARC encryption implemented, which Oracle did not have at the time. Implementation details that should be looked at I created a new DMU_OT_* for keychain objects instead of using the DMU_OTN() macro. I did this mostly for the ability to to register a name which helped with debugging. The Keychain objects also seem like a core enough structure to warrant a new dedicated object type. The crypto framework has some code bloat to it, particularly in the form of function stubs in header files that are never actually implemented. I figured it would be best to leave these in, in case more functions needed to be ported over. The in-memory keystore is not the most efficient structure, since it zeros and frees encryption keys whenever they are not in use. This is intended as a security measure so that unwrapped keys do not exist in memory longer than they are needed. Encrypting data going to disk requires creating a keychain_record_t during dsl_dataset_tryown(). I added a flag to this function for code that wishes to own the dataset, but that does not require encrypted data, such as the scrub functions. I did my best to confirm that all owners set this flag correctly, but someone should confirm them, just to be sure. zfs send and zfs recv do not currently do anything special with regards to encryption. The format of the send file has not changed and zfs send requires the keys to be loaded in order to work. At some point there should probably be a way to do encrypted sends. I altered the prototype of lzc_create() and lzc_clone(). I understand that the purpose of libzfs_core is to have a stable api interacting with the ZFS ioctls. However, these functions need to accept wrapping keys separately from the rest of their parameters because they need to use the (new) hidden_args framework to support hiding arguments from the logs. Without this, the wrapping keys would get printed to the zpool history. There is an extra local label that I needed to add to the top of of the "global" function rijndael_key_setup_enc_intel() in module/icp/aes_intel.S. For some reason, I had to use the local label or the module would fail to link. If any assembly experts can tell me why this is required or a better way to fix it, I would appreciate it. As part of the L2ARC changes, I added a 8 byte MAC field to l2arc_buf_hdr_t. I understand that there are a lot of reasons to keep this struct small since many of them may be allocated at once, but I do not seem to have another reasonable option here. The icp is a kernel module that has a directory structure (unlike the other modules in zfs). There are a few reasons for this. First, the ICP has assembly code for different CPU architectures and I wanted to match the structure of libspl. The ICP also has headers that did not really need to belong in the global zfs headers and so it made sense to make an includes directory for them locally. The directory structure also approximately mimics the the structure of the Illumos Crypto Framework, which will be important for maintainability. As a result, I had to adjust the build systems to avoid flattening module directories. This shouldn't matter much, since the other modules were already flat. ------------------------------------------- openzfs-developer Archives: https://www.listbox.com/member/archive/274414/=now RSS Feed: https://www.listbox.com/member/archive/rss/274414/28133750-22ed9730 Modify Your Subscription: https://www.listbox.com/member/?member_id=28133750&id_secret=28133750-6c0d6209 Powered by Listbox: http://www.listbox.com From owner-freebsd-fs@freebsd.org Wed May 18 13:49:42 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 704B8B3F045 for ; Wed, 18 May 2016 13:49:42 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from smtp.simplesystems.org (smtp.simplesystems.org [65.66.246.90]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 3209E1D7B for ; Wed, 18 May 2016 13:49:41 +0000 (UTC) (envelope-from bfriesen@simple.dallas.tx.us) Received: from freddy.simplesystems.org (freddy.simplesystems.org [65.66.246.65]) by smtp.simplesystems.org (8.14.4+Sun/8.14.4) with ESMTP id u4IDnYR4028198; Wed, 18 May 2016 08:49:35 -0500 (CDT) Date: Wed, 18 May 2016 08:49:34 -0500 (CDT) From: Bob Friesenhahn X-X-Sender: bfriesen@freddy.simplesystems.org To: Alex Tutubalin cc: "freebsd-fs@freebsd.org" Subject: Re: ZFS performance bottlenecks: CPU or RAM or anything else? In-Reply-To: Message-ID: References: <8441f4c0-f8d1-f540-b928-7ae60998ba8e@lexa.ru> <16e474da-6b20-2e51-9981-3c262eaff350@lexa.ru> <1e012e43-a49b-6923-3f0a-ee77a5c8fa70@lexa.ru> <86shxgsdzh.fsf@WorkBox.Home> User-Agent: Alpine 2.20 (GSO 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (smtp.simplesystems.org [65.66.246.90]); Wed, 18 May 2016 08:49:35 -0500 (CDT) X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 13:49:42 -0000 On Wed, 18 May 2016, Alex Tutubalin wrote: > > If so, raidz will have huge write performance benefit in my case: single > write of one large file. This is not proven in practice. With mirrors one typically has more vdevs and each vdev gets a zfs block-size write in turn, using a round robin agorithm (tuned for available space vdev). Drive IOPs are saved since the blocks are not diced into smaller fragments (as raidzN requires). With raidz it is necessary to also pay the cost of the parity computations, which are not needed with mirroring. Bob -- Bob Friesenhahn bfriesen@simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/ From owner-freebsd-fs@freebsd.org Wed May 18 23:03:27 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id AA0A0B41494 for ; Wed, 18 May 2016 23:03:27 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 95EC516D8 for ; Wed, 18 May 2016 23:03:27 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 953BCB41493; Wed, 18 May 2016 23:03:27 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 94DC5B41492 for ; Wed, 18 May 2016 23:03:27 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id 43CEE16D5 for ; Wed, 18 May 2016 23:03:26 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id A41E6D6568E; Thu, 19 May 2016 09:03:21 +1000 (AEST) Date: Thu, 19 May 2016 09:03:19 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160518110834.GJ89104@kib.kiev.ua> Message-ID: <20160519065714.H1393@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=FF35ox4yeJKPj-cm5okA:9 a=ZiNstujT2j9NvtWX:21 a=EXmI02U5j4sSZ_lM:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 May 2016 23:03:27 -0000 On Wed, 18 May 2016, Konstantin Belousov wrote: > On Wed, May 18, 2016 at 10:00:09AM +1000, Bruce Evans wrote: >> On Wed, 18 May 2016, Konstantin Belousov wrote: >>> VCHR check ensures that the devvp vnode is not reclaimed. I do not want >>> to remove the check and rely on the caller of ffs_mountfs() to always do >>> the right thing for it without unlocking devvp, this is too subtle. >> >> Surely the caller must lock devvp? Otherwise none of the uses of devvp >> can be trusted, and there are several others. > It must lock, but the interface of ffs_mountfs() would then require > that there is no relock between vn_isdisk() check and call. > > I think I know how to make a good compromise there. I converted the > check for VCHR into the assert. But it is very clear there is no re-lock, and that there must be no re-lock to work ("very clear" relative other complications). ffs_mountfs() is only called once and only exists to make the function more readable and debuggable (and auto-inlining it breaks debugging). Its nearby logic is: namei(); // lock vnode vn_isdisk(); // return if not if (MNT_UPDATE) fail_sometimes(); // locking problems -- see below else ffs_mountfs(); // clearly guaranteed still VCHR I found another locking problem for revoke. After mounting /i and revoking its device, mount -u fails. This is clearly because its rdev has gone away. This makes devvp->v_rdev != ump->um_devvp->v_rdev. The new devvp has the old rdev and the old devvp has a null rdev. This is not really a locking problem, but the correct behaviour. Most places just don't check. >> There is also ump->um_devvvp, but this seems to be unusable since it >> might go away. > Go away as in being reclaimed, yes. The vnode itself is there, since > we keep a reference. I think "reclaimed" is the wrong terminology. The reference prevents it being reclaimed by vnlrureclaim(), but doesn't prevent it being revoked (or vgone()d by a forced unmount of the devfs instance that it is on). The reference prevents it being reclaimed even if it is revoked. When it is revoked, some but apparently not all of the pointers in it are cleared or become garbage. None of them should be used, but some are. v_rdev is cleared and we are fairly careful not to follow it, but we depend on it being cleared and not garbage. Pointers that are not cleared include v_bufobj (apparently) and GEOM's hooks related to v_bufobj, and si_mountpt. si_mountpt is in the cdev and not in the vnode. >> So using the devvp->v_rdev instead of the dev variable is not just a >> style bug. > Might be. In some places. ump->um_devvp->v_rdev gives the old rdev, and devvp->v_rdev gives the current rdev provided devvp is locked. These can be compared to see if the old rdev was revoked. Otherwise, devvp->v_rdev is garbage and both ump->um_dev and ump->um_devvp are close to garbage -- they are both old and the only correct use of this is to check that they are still current, but then you have the current devvp (locked) and can use it instead. >> ... >> You only needed to move the unlocking to fix. devvp->v_bufobj. How does >> that work? The write is now locked, but if devvp goes away, then don't >> we lose its bufobj? > The buffer queues are flushed, and BO_DEAD flag is set. But the flag > does very little. > >> How does any use of ump->um_devvp work? The problems are similar to the ones with ttys that we are still working on. When the device is revoked, there may be many i/o's in progess on it. We don't want to block waiting for these, but they should be aborted before doing any more. But there are enough stale pointers to even allow new i/o's. Enough for tar cf of a complete small file system. >> I tried revoke(2) on the devvp of a mounted file system. This worked >> to give v_type = VBAD and v_rdev = NULL, but didn't crash. ffs_unmount() >> checked for the bad vnode, unlike most places, and failed to clear >> si_mountpt. >> >> Normal use doesn't have revokes, but if the vnode is reclaimed instead >> of just becoming bad, then worse things probably happen. I think vnode >> cache resizing gives very unstable storage so the pointer becomes very >> invalid. But even revoke followed by setting kern.numvnodes to 1 didn't >> crash (15 vnodes remained). So devvp must be referenced throughout. >> It seems to have reference count 2, since umounting reduced kern.numvnodes >> from 15 to 13. (It is surprising how much works with kern.maxvnodes=1. >> I was able to run revoke, sysctl and umount.) It is still a mystery >> that the VBAD vnode doesn't crash soon. >> > I believe that bo_ops assignment is the reason why UFS mounts survive the > reclamation of the devvp vnode. Take a look at the ffs_geom_strategy(), > which is the place where UFS io is tunneled directly into geom. It does > not pass io requests through devfs. As result, revocation does not > change much except doing unneccessary buf queue flush. > > It might be telling to try the same experiment, as conducted in your > next message, on msdosfs instead of UFS. Everything seems to work exactly the same for msdosfs. I retried: - mount; tar; revoke; mount; tar # (2nd mount succeeds due to revoke) - mount; tar; mount-another-devfs; mount-using-other-devfs; tar # (2nd mount succeeds due to separate devvp). No crashes. I didn't risk any rw mounts. > Below is the simplified patch. > > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > index 712fc21..412b000 100644 > --- a/sys/ufs/ffs/ffs_vfsops.c > +++ b/sys/ufs/ffs/ffs_vfsops.c > @@ -764,6 +764,7 @@ ffs_mountfs(devvp, mp, td) > cred = td ? td->td_ucred : NOCRED; > ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > dev = devvp->v_rdev; > dev_ref(dev); > DROP_GIANT(); Not needed. > @@ -771,17 +772,17 @@ ffs_mountfs(devvp, mp, td) > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > g_topology_unlock(); > PICKUP_GIANT(); > - VOP_UNLOCK(devvp, 0); > - if (error) > + if (error != 0) { > + VOP_UNLOCK(devvp, 0); > goto out; Needed. > - if (devvp->v_rdev->si_iosize_max != 0) > - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; > + } > + if (dev->si_iosize_max != 0) > + mp->mnt_iosize_max = dev->si_iosize_max; > if (mp->mnt_iosize_max > MAXPHYS) > mp->mnt_iosize_max = MAXPHYS; > - > devvp->v_bufobj.bo_ops = &ffs_ops; > - if (devvp->v_type == VCHR) > - devvp->v_rdev->si_mountpt = mp; > + dev->si_mountpt = mp; > + VOP_UNLOCK(devvp, 0); > > fs = NULL; > sblockloc = 0; I would keep the unlock as early as possible. Just move the initialization of v_bufobj before it. BTW, I don't like the fixup for > MAXPHYS. This is removed from all file systems in my version. dev->si_iosize_max should be clamped to MAXPHYS unless larger sizes work, and if larger sizes work then individual file systems don't know enough to kill using them. The check for si_iosize_max != 0 is bogus too, but not removed in my version. mp->mnt_iosize_max defaults to DFLTPHYS and the check avoids changing that, but if si_iosize_max remains at 0 then i/o won't actually work, and if some bug results in si_iosize_max being initialized later but early enough for some i/o to work, then the default of DFLTPHYS still won't work if it is larger than the driver size. g_dev_taste() actually defaults si_iosize_max to MAXPHYS and I think GEOM hides the driver iosize_max from file systems so I think si_iosize_max is actually always MAXPHYS here. > @@ -1083,8 +1084,7 @@ ffs_mountfs(devvp, mp, td) > out: > if (bp) > brelse(bp); > - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) > - devvp->v_rdev->si_mountpt = NULL; > + dev->si_mountpt = NULL; > if (cp != NULL) { > DROP_GIANT(); > g_topology_lock(); I think this is still racy, but the race is more harmless than most of the problems from revokes. I think the following can happen: - after we unlock, another mount starts and and clobbers our si_mountpt with a nonzero value. Then this clobbers the other mount's si_mountpt with a zero value. The zero value is relatively harmless. It takes either a revoke, a separate devfs instance, or the old multiple-mount- allowing code for another mount to start. The old code has a smaller race window: - since the vnode is unlocked, it gives a null pointer panic if the v_rdev becomes null after it is tested to be non-null, or if it is still non-null then using it may clobber another mount's si_mountpt if the other mount set si_mountpt races with us. It takes a revoke to get the null pointer. Clobbering only takes a separate devfs instance or the old multiple-mount code. > @@ -1287,8 +1287,7 @@ ffs_unmount(mp, mntflags) > g_vfs_close(ump->um_cp); > g_topology_unlock(); > PICKUP_GIANT(); > - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) > - ump->um_devvp->v_rdev->si_mountpt = NULL; > + ump->um_dev->si_mountpt = NULL; > vrele(ump->um_devvp); > dev_rel(ump->um_dev); > mtx_destroy(UFS_MTX(ump)); This has the same problems as cleaning up after an error in mount. I think the following works to prevent multiple mounts via all of the known buggy paths: early in every fsmount(): dev = devvp->v_rdev; if (dev->si_mountpt != NULL) { cleanup(); return (EBUSY); } dev->si_mountpt = mp; This also prevents other mounts racing with us before we complete. Too bad if we fail but the other mount would have succeeded. In fsunmount(), move clearing si_mountpt to near the end. I hope si_mountpt is locked by the device reference and that this makes si_mountpt robust enough to use as an exclusive access flag. GEOM's exclusive access counters somehow don't prevent the multiple mounts. I think they are too closely associated with the vnode via v_bufobj. Bruce From owner-freebsd-fs@freebsd.org Thu May 19 00:24:04 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id CE72AB4197B for ; Thu, 19 May 2016 00:24:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id BAF191AD0 for ; Thu, 19 May 2016 00:24:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id B6508B41979; Thu, 19 May 2016 00:24:04 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B5FD1B41978 for ; Thu, 19 May 2016 00:24:04 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id 55E6F1ACE for ; Thu, 19 May 2016 00:24:03 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id C8019425632; Thu, 19 May 2016 10:23:54 +1000 (AEST) Date: Thu, 19 May 2016 10:23:54 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Konstantin Belousov , fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160519065714.H1393@besplex.bde.org> Message-ID: <20160519094901.O1798@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=o_ISSmQlpySo1G4zfAAA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 May 2016 00:24:04 -0000 On Thu, 19 May 2016, Bruce Evans wrote: > ... > I think the following works to prevent multiple mounts via all of the > known buggy paths: early in every fsmount(): > > dev = devvp->v_rdev; > if (dev->si_mountpt != NULL) { > cleanup(); > return (EBUSY); > } > dev->si_mountpt = mp; > > This also prevents other mounts racing with us before we complete. Too > bad if we fail but the other mount would have succeeded. In fsunmount(), > move clearing si_mountpt to near the end. I hope si_mountpt is locked > by the device reference and that this makes si_mountpt robust enough to > use as an exclusive access flag. Nah, the reference is not a lock. This needs dev_lock() or similar to be robust. struct cdef has no documented locking, bug dev_lock() should work and is probably needed for writes. It is never used for accesses to si_mountpt now. Reads are safe enough since the are of the form 'mp = dev->si_mountpt; if (mp == NULL) dont_use_mp();'. Bruce From owner-freebsd-fs@freebsd.org Thu May 19 02:26:17 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 37314B40360 for ; Thu, 19 May 2016 02:26:17 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 22FD31D2F for ; Thu, 19 May 2016 02:26:17 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 22403B4035F; Thu, 19 May 2016 02:26:17 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1F93CB4035E for ; Thu, 19 May 2016 02:26:17 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id D75C11D2E for ; Thu, 19 May 2016 02:26:16 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id A6997D689E5; Thu, 19 May 2016 12:20:20 +1000 (AEST) Date: Thu, 19 May 2016 12:20:19 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Konstantin Belousov , fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160519094901.O1798@besplex.bde.org> Message-ID: <20160519120557.A2250@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=gX_jSoV2WuXo949cYewA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 May 2016 02:26:17 -0000 On Thu, 19 May 2016, Bruce Evans wrote: > On Thu, 19 May 2016, Bruce Evans wrote: > >> ... >> I think the following works to prevent multiple mounts via all of the >> known buggy paths: early in every fsmount(): Here is a lightly tested version: X Index: ffs_vfsops.c X =================================================================== X --- ffs_vfsops.c (revision 300160) X +++ ffs_vfsops.c (working copy) X @@ -771,18 +771,24 @@ X error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); X g_topology_unlock(); X PICKUP_GIANT(); X + /* XXX: v_bufobj is left set after errors. */ X + devvp->v_bufobj.bo_ops = &ffs_ops; X VOP_UNLOCK(devvp, 0); Since v_bufobj isn't cleaned after later errors, I didn't move the unlock to keep it clean for this error alone. X if (error) X - goto out; X - if (devvp->v_rdev->si_iosize_max != 0) X - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; X + goto out1; X + dev_lock(); X + if (dev->si_mountpt != NULL) { X + dev_unlock(); X + error = EBUSY; X + goto out1; X + } X + dev->si_mountpt = mp; X + dev_unlock(); X + if (dev->si_iosize_max != 0) X + mp->mnt_iosize_max = dev->si_iosize_max; X if (mp->mnt_iosize_max > MAXPHYS) X mp->mnt_iosize_max = MAXPHYS; X X - devvp->v_bufobj.bo_ops = &ffs_ops; X - if (devvp->v_type == VCHR) X - devvp->v_rdev->si_mountpt = mp; X - X fs = NULL; X sblockloc = 0; X /* X @@ -1081,10 +1087,14 @@ X #endif /* !UFS_EXTATTR */ X return (0); X out: X + dev_lock(); X + if (dev->si_mountpt == NULL) X + panic("lost si_mountpt in mount"); X + dev->si_mountpt = NULL; X + dev_unlock(); I don't want the debugging panics or KASSERTs in the final version. Explicit locking the stores of NULL is probably not needed. dev_rel() will soon make these stores visible and other locking and ordering makes it very unlikely that they become visible too early. X +out1: X if (bp) X brelse(bp); X - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) X - devvp->v_rdev->si_mountpt = NULL; X if (cp != NULL) { X DROP_GIANT(); X g_topology_lock(); X @@ -1287,8 +1297,11 @@ X g_vfs_close(ump->um_cp); X g_topology_unlock(); X PICKUP_GIANT(); X - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) X - ump->um_devvp->v_rdev->si_mountpt = NULL; X + dev_lock(); X + if (ump->um_dev->si_mountpt == NULL) X + panic("lost si_mountpt in unmount"); X + ump->um_dev->si_mountpt = NULL; X + dev_unlock(); X vrele(ump->um_devvp); X dev_rel(ump->um_dev); X mtx_destroy(UFS_MTX(ump)); Bruce From owner-freebsd-fs@freebsd.org Thu May 19 10:41:36 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 103FCB42EE2 for ; Thu, 19 May 2016 10:41:36 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id ECACE1AB6 for ; Thu, 19 May 2016 10:41:35 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id EBFB7B42EE1; Thu, 19 May 2016 10:41:35 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EBA37B42EE0 for ; Thu, 19 May 2016 10:41:35 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 9730B1AB3 for ; Thu, 19 May 2016 10:41:35 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4JAfTdn056064 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Thu, 19 May 2016 13:41:29 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4JAfTdn056064 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4JAfS7R056063; Thu, 19 May 2016 13:41:28 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Thu, 19 May 2016 13:41:28 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs Message-ID: <20160519104128.GN89104@kib.kiev.ua> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160519120557.A2250@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 May 2016 10:41:36 -0000 On Thu, May 19, 2016 at 12:20:19PM +1000, Bruce Evans wrote: > On Thu, 19 May 2016, Bruce Evans wrote: > > > On Thu, 19 May 2016, Bruce Evans wrote: > > > >> ... > >> I think the following works to prevent multiple mounts via all of the > >> known buggy paths: early in every fsmount(): > > Here is a lightly tested version: There is no need to protect the si_mountpt with any locking, the field itself serves as a lock good enough, also preventing the parallel mounts of the same devices. I changed the assignement to atomic_cmpset, which is enough there. It is somewhat pity that this would reliably disable multiple ro mounts of the same volume. There is no need to move assignment of NULL to dev->si_mountpt later in ffs_unmount(), the moment where the assignment is performed is safe for other thread to start another mount. I still want to keep devvp locked for long enough to cover the bufobj hacking, and I do not want to move bufobj.bo_ops change before g_vfs_open() succeed. I also wanted to remove GIANT dances, but this requires geom patch, which I will mail separately. diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index 712fc21..21425f5 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) cred = td ? td->td_ucred : NOCRED; ronly = (mp->mnt_flag & MNT_RDONLY) != 0; + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); dev = devvp->v_rdev; dev_ref(dev); + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { + dev_rel(dev); + VOP_UNLOCK(devvp, 0); + return (EBUSY); + } DROP_GIANT(); g_topology_lock(); error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); g_topology_unlock(); PICKUP_GIANT(); - VOP_UNLOCK(devvp, 0); - if (error) + if (error != 0) { + VOP_UNLOCK(devvp, 0); goto out; - if (devvp->v_rdev->si_iosize_max != 0) - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; + } + if (dev->si_iosize_max != 0) + mp->mnt_iosize_max = dev->si_iosize_max; if (mp->mnt_iosize_max > MAXPHYS) mp->mnt_iosize_max = MAXPHYS; - devvp->v_bufobj.bo_ops = &ffs_ops; - if (devvp->v_type == VCHR) - devvp->v_rdev->si_mountpt = mp; + VOP_UNLOCK(devvp, 0); fs = NULL; sblockloc = 0; @@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td) out: if (bp) brelse(bp); - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) - devvp->v_rdev->si_mountpt = NULL; if (cp != NULL) { DROP_GIANT(); g_topology_lock(); @@ -1102,6 +1105,7 @@ out: free(ump, M_UFSMNT); mp->mnt_data = NULL; } + dev->si_mountpt = NULL; dev_rel(dev); return (error); } @@ -1287,8 +1291,7 @@ ffs_unmount(mp, mntflags) g_vfs_close(ump->um_cp); g_topology_unlock(); PICKUP_GIANT(); - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) - ump->um_devvp->v_rdev->si_mountpt = NULL; + ump->um_dev->si_mountpt = NULL; vrele(ump->um_devvp); dev_rel(ump->um_dev); mtx_destroy(UFS_MTX(ump)); From owner-freebsd-fs@freebsd.org Thu May 19 23:27:49 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D4423B42384 for ; Thu, 19 May 2016 23:27:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id BF5321D43 for ; Thu, 19 May 2016 23:27:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id BAEA7B42381; Thu, 19 May 2016 23:27:49 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BA8A3B4237E for ; Thu, 19 May 2016 23:27:49 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 6BED21D42 for ; Thu, 19 May 2016 23:27:48 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c110-21-42-169.carlnfd1.nsw.optusnet.com.au [110.21.42.169]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 967091049C54; Fri, 20 May 2016 09:27:39 +1000 (AEST) Date: Fri, 20 May 2016 09:27:38 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160519104128.GN89104@kib.kiev.ua> Message-ID: <20160520074427.W1151@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=kDyANCGC9fy361NNEb9EQQ==:117 a=kDyANCGC9fy361NNEb9EQQ==:17 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=138hlgp83k9Wl2frAKkA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 19 May 2016 23:27:49 -0000 On Thu, 19 May 2016, Konstantin Belousov wrote: > On Thu, May 19, 2016 at 12:20:19PM +1000, Bruce Evans wrote: >> On Thu, 19 May 2016, Bruce Evans wrote: >> >>> On Thu, 19 May 2016, Bruce Evans wrote: >>> >>>> ... >>>> I think the following works to prevent multiple mounts via all of the >>>> known buggy paths: early in every fsmount(): >> >> Here is a lightly tested version: > > There is no need to protect the si_mountpt with any locking, the field > itself serves as a lock good enough, also preventing the parallel mounts > of the same devices. I changed the assignement to atomic_cmpset, which > is enough there. It is somewhat pity that this would reliably disable > multiple ro mounts of the same volume. I used a mutex since it is simpler. I think your version needs atomic ops for resetting the pointer, and maybe acquire/release too. It has locking that is very similar to a mutex. Mutexes use _mtx_obtain_lock = atomic_cmpset_acq_ptr and _mtx_release_lock = atomic_store_rel_ptr. This is already delicately weak -- full sequential consistency is not required. Then on x86, we (you) only recently finished optimizing atomic_store_rel so that it is as weak as possible (just a compiler membar before an ordinary store). Maybe even weaker locking is enough here, but this is too hard to understand. > There is no need to move assignment of NULL to dev->si_mountpt later > in ffs_unmount(), the moment where the assignment is performed is safe > for other thread to start another mount. I already noticed that it was almost as late as possible (could be moved 1 statement later) and not worth moving. But to even reason about orders, you need atomic releases with acquire/ release semantics. There are dev_ref() and dev_rel() calls nearby. The implementation of these probably has to and in fact does give some ordering. The details are too hard to understand. In ffs_unmount() I think it is actually ordering given by vrele() that makes things work: > PICKUP_GIANT(); > - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) > - ump->um_devvp->v_rdev->si_mountpt = NULL; > + ump->um_dev->si_mountpt = NULL; > vrele(ump->um_devvp); > dev_rel(ump->um_dev); We want the store to si_mountpt to become visible before the vnode is unlocked. Otherwise, a new mount can lock the vnode and fail with EBUSY because it sees si_mountpt != NULL. We have to know implementation details of vrele() to know that this happens. > I still want to keep devvp locked for long enough to cover the bufobj > hacking, and I do not want to move bufobj.bo_ops change before > g_vfs_open() succeed. I didn't move it before g_vfs_open(), but before VOP_UNLOCK(). I think v_bufobj is never cleared, but garbage in it is harmless except in the multiple-mounts case which is now disallowed. > I also wanted to remove GIANT dances, but this requires geom patch, > which I will mail separately. OK. I saw the other mail. > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > index 712fc21..21425f5 100644 > --- a/sys/ufs/ffs/ffs_vfsops.c > +++ b/sys/ufs/ffs/ffs_vfsops.c > @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) > cred = td ? td->td_ucred : NOCRED; > ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > dev = devvp->v_rdev; > dev_ref(dev); > + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { > + dev_rel(dev); > + VOP_UNLOCK(devvp, 0); > + return (EBUSY); > + } This is cleaner and safer than my version. > DROP_GIANT(); > g_topology_lock(); > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); g_vfs_open() already sets devvp->v_bufobj.bo_ops to g_vfs_bufops unless it fails. This clobbered our setting in the buggy multiple-mount case. But with multiple mounts not allowed, this cleans up any garbage in v_bufobj. g_vfs_open() has 2 failures for non-exclusive access. It starts by checking v_bufobj.bo_private == devvp (this is after translating its pointers to the ones passed here). This is avg's fix for the multiple- mounts problem (r206130). It doesn't work in all cases. I think this is unecessary now. Later, g_vfs_open() does a g_access() check for exclusive-enough access. This is supposed to allow multiple mounts at least when all are ro. I thought that avg modified this, but he actually did something different. I think this check only failed in buggy cases where multiple mounts were allowed. Our changes should make it never fail. It still returns the wrong errno (some general one return by g_access() instead of the one documented for mount() -- this is EBUSY). > g_topology_unlock(); > PICKUP_GIANT(); > - VOP_UNLOCK(devvp, 0); I don't like moving this below devvp accesses. It locks devvp, not dev. > - if (error) > + if (error != 0) { > + VOP_UNLOCK(devvp, 0); > goto out; > - if (devvp->v_rdev->si_iosize_max != 0) > - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; > + } > + if (dev->si_iosize_max != 0) > + mp->mnt_iosize_max = dev->si_iosize_max; dev->si_iosize_max is locked by its undocumented lifetime. It is invariant since some previous time. > if (mp->mnt_iosize_max > MAXPHYS) > mp->mnt_iosize_max = MAXPHYS; > - > devvp->v_bufobj.bo_ops = &ffs_ops; This needs to be before the vnode unlock of course. I don't like the complication to avoid setting this if we g_vfs_open_fails, but this at least makes it obvious that we don't set it to garbage when g_vfs_open_fails. In other error cases, and even after unmount, I think v_bufobj is left as garbage. I now see another cleanup: don't goto out when g_vfs_open() fails. This depends on it setting cp to NULL and leaving nothing to clean when it fails. It has no man page and this detail is documented in its source code. > - if (devvp->v_type == VCHR) > - devvp->v_rdev->si_mountpt = mp; > + VOP_UNLOCK(devvp, 0); > > fs = NULL; > sblockloc = 0; > @@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td) > out: > if (bp) > brelse(bp); > - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) > - devvp->v_rdev->si_mountpt = NULL; > if (cp != NULL) { > DROP_GIANT(); > g_topology_lock(); > @@ -1102,6 +1105,7 @@ out: > free(ump, M_UFSMNT); > mp->mnt_data = NULL; > } > + dev->si_mountpt = NULL; This should remain before the vnode unlock. Otherwise a new mount can fail unnecessarily. > dev_rel(dev); > return (error); > } > @@ -1287,8 +1291,7 @@ ffs_unmount(mp, mntflags) > g_vfs_close(ump->um_cp); > g_topology_unlock(); > PICKUP_GIANT(); > - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) > - ump->um_devvp->v_rdev->si_mountpt = NULL; > + ump->um_dev->si_mountpt = NULL; > vrele(ump->um_devvp); > dev_rel(ump->um_dev); This order is better for avoiding unnecessary failure for new mounts, but now I'm not sure if it is right. Anyway, it matters less to get an unnecessary failures for a new mount after a long-lived old mount than after a failed mount, so the cleanup shouldn't be stricter than here. > mtx_destroy(UFS_MTX(ump)); > Bruce From owner-freebsd-fs@freebsd.org Fri May 20 00:27:12 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 5DF15B415A1 for ; Fri, 20 May 2016 00:27:12 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 48E3517E6 for ; Fri, 20 May 2016 00:27:12 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 482FCB4159F; Fri, 20 May 2016 00:27:12 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 47D62B4159E for ; Fri, 20 May 2016 00:27:12 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail109.syd.optusnet.com.au (mail109.syd.optusnet.com.au [211.29.132.80]) by mx1.freebsd.org (Postfix) with ESMTP id F33AD17E5 for ; Fri, 20 May 2016 00:27:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c110-21-42-169.carlnfd1.nsw.optusnet.com.au [110.21.42.169]) by mail109.syd.optusnet.com.au (Postfix) with ESMTPS id 36948D6F26C; Fri, 20 May 2016 10:11:39 +1000 (AEST) Date: Fri, 20 May 2016 10:11:38 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Konstantin Belousov , fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160520074427.W1151@besplex.bde.org> Message-ID: <20160520095504.X1527@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> <20160520074427.W1151@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=c+ZWOkJl c=1 sm=1 tr=0 a=kDyANCGC9fy361NNEb9EQQ==:117 a=kDyANCGC9fy361NNEb9EQQ==:17 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=_iGiZFj0rCtMW9D2ktwA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 00:27:12 -0000 PS: On Fri, 20 May 2016, Bruce Evans wrote: > On Thu, 19 May 2016, Konstantin Belousov wrote: >> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c >> index 712fc21..21425f5 100644 >> --- a/sys/ufs/ffs/ffs_vfsops.c >> +++ b/sys/ufs/ffs/ffs_vfsops.c >> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) >> cred = td ? td->td_ucred : NOCRED; >> ronly = (mp->mnt_flag & MNT_RDONLY) != 0; >> >> + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); I still don't like this. The source code tends to fill up with assertions (and comments) about simple things. >> dev = devvp->v_rdev; >> dev_ref(dev); >> + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { I used != 0. >> @@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td) >> out: >> if (bp) >> brelse(bp); >> - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) >> - devvp->v_rdev->si_mountpt = NULL; >> if (cp != NULL) { >> DROP_GIANT(); >> g_topology_lock(); >> @@ -1102,6 +1105,7 @@ out: >> free(ump, M_UFSMNT); >> mp->mnt_data = NULL; >> } >> + dev->si_mountpt = NULL; > > This should remain before the vnode unlock. Otherwise a new mount can > fail unnecessarily. > >> dev_rel(dev); >> return (error); >> } Oops. The vnode lock is not held here, so the order in ffs_mount() cannot be duplicated. I moved the resetting of si_mountpt down to here too. Not locking here and elsewhere makes the locking for v_bufobj even more obscure. v_bufobj is a whole struct living in the vnode. It has no locking annotation, but has a style bug (a stray '*') where its locking annotation should be. g_vfs_open() sets sc->sc_bo to &vp->v_bufobj and I think most uses of this don't lock the vnode or check if has been revoked. Perhaps ro accesses are OK (revoke() must not clean v_bufobj). Cleaning v_bufobj on mount failure without the vnode lock would be a bug. I think it is just not cleaned or used until the next mount changes it. Bruce From owner-freebsd-fs@freebsd.org Fri May 20 03:22:22 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2FA42B43A04 for ; Fri, 20 May 2016 03:22:22 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 1A74E10B3 for ; Fri, 20 May 2016 03:22:22 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 19AEBB43A03; Fri, 20 May 2016 03:22:22 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 170E3B43A02 for ; Fri, 20 May 2016 03:22:22 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail104.syd.optusnet.com.au (mail104.syd.optusnet.com.au [211.29.132.246]) by mx1.freebsd.org (Postfix) with ESMTP id B6B7110B2 for ; Fri, 20 May 2016 03:22:21 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c110-21-42-169.carlnfd1.nsw.optusnet.com.au [110.21.42.169]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 9587242AB98; Fri, 20 May 2016 13:22:13 +1000 (AEST) Date: Fri, 20 May 2016 13:22:09 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Konstantin Belousov , fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160520095504.X1527@besplex.bde.org> Message-ID: <20160520120927.V2190@besplex.bde.org> References: <20160517072104.I2137@besplex.bde.org> <20160517084241.GY89104@kib.kiev.ua> <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> <20160520074427.W1151@besplex.bde.org> <20160520095504.X1527@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=kDyANCGC9fy361NNEb9EQQ==:117 a=kDyANCGC9fy361NNEb9EQQ==:17 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=QWTFWtZIVJHxjm3Yne0A:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 03:22:22 -0000 PS2: On Fri, 20 May 2016, Bruce Evans wrote: > PS: > > On Fri, 20 May 2016, Bruce Evans wrote: > >> On Thu, 19 May 2016, Konstantin Belousov wrote: >>> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c >>> index 712fc21..21425f5 100644 >>> --- a/sys/ufs/ffs/ffs_vfsops.c >>> +++ b/sys/ufs/ffs/ffs_vfsops.c >>> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) >>> cred = td ? td->td_ucred : NOCRED; >>> ronly = (mp->mnt_flag & MNT_RDONLY) != 0; >>> >>> + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > > I still don't like this. The source code tends to fill up with assertions > (and comments) about simple things. > >>> dev = devvp->v_rdev; >>> dev_ref(dev); >>> + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { > > I used != 0. All file systems need this of course. zfs doesn't use g_vfs_open(), so how can it possibly work to give exclusive access to the device in contention with other mount operations? I think it doesn't even try. Old code used vfs_mountedon() here. vfs_mountedon() was just the above cmp (but not set) in an extern function, with no obvious locking and an a differently bad name for si_mountpt (it was si_mountpoint. The correct name is si_mp). The vnode should be locked, but this was only enough if the old aliasing code gave a unique vnode. Old code also returned EBUSY if vcount(devvp) > 1 && devvp != rootvp. Here rootvp is special to support some old hack involving abusing the swap device for miniroot. This was supposed to have been replaced by g_access() checks in g_vfs_open(), but those aren't exclusive enough, so there is another check in g_vfs_open() but that isn't exclusive enough so we are trying to fix it now. zfs_mount() seems to have no exclusivity check at all, except in the illumos case it has the old vcount() check with rootvp hack (spelled differently as (v_flag & VROOT)). zfs might support multiple mounts, but it can only do that for itself and the vcount() check normally prevents this for the illumos case. vfs_mountedon() is as good an interface as any for checking for exclusive access in a shared way (except it probably can't support multiple mounts like g_vfs_open() is supposed to). It was in 4.4BSD-Lite. 4.4BSD-Lite doesn't have si_mountp[oin]t. It used old alias stuff which is relatively easy to understand for it. It searches the list of aliases and skips ones whose type differs (these are presumably revoked ones). The aliases are vnodes with a common rdev. Each vnode has a V_MOUNTEDON flag. I wonder if current bugs affected this too -- after revoke, the device is still mounted but its vnode is too messed up to show this. vfs_mountedon() in FreeBSD-3 is similar to in 4.4BSD-Lite. In FreeBSD-4, it is an in-between version that looks broken since it doesn't have the alias loop; it depends on vp->v_specmountpoint being the same for all aliases even with only 1 of the aliases locked. Bruce From owner-freebsd-fs@freebsd.org Fri May 20 08:01:31 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E3A67B43D0F for ; Fri, 20 May 2016 08:01:31 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D406D1E3E for ; Fri, 20 May 2016 08:01:31 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4K81Vsk000577 for ; Fri, 20 May 2016 08:01:31 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209580] ZFS and geli broken with INVARIANTS enabled Date: Fri, 20 May 2016 08:01:31 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: linimon@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: assigned_to Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 08:01:32 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209580 Mark Linimon changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|freebsd-bugs@FreeBSD.org |freebsd-fs@FreeBSD.org --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Fri May 20 08:02:50 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2616DB43E2E for ; Fri, 20 May 2016 08:02:50 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 1700010B5 for ; Fri, 20 May 2016 08:02:50 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4K82nLL020453 for ; Fri, 20 May 2016 08:02:49 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209571] ZFS and NVMe performing poorly. TRIM requests stall I/O activity Date: Fri, 20 May 2016 08:02:50 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Many People X-Bugzilla-Who: linimon@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: assigned_to Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 08:02:50 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209571 Mark Linimon changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|freebsd-bugs@FreeBSD.org |freebsd-fs@FreeBSD.org --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Fri May 20 08:07:28 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E7E54B420CF for ; Fri, 20 May 2016 08:07:28 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id D8B3A13B3 for ; Fri, 20 May 2016 08:07:28 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4K87SSP033592 for ; Fri, 20 May 2016 08:07:28 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209508] zfs import assertion failed in avl_add() Date: Fri, 20 May 2016 08:07:29 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: patch X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: linimon@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: keywords assigned_to Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 08:07:29 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209508 Mark Linimon changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |patch Assignee|freebsd-bugs@FreeBSD.org |freebsd-fs@FreeBSD.org --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Fri May 20 08:12:28 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 76367B426CB for ; Fri, 20 May 2016 08:12:28 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 66F8F1CBB for ; Fri, 20 May 2016 08:12:28 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id u4K8CS1R046912 for ; Fri, 20 May 2016 08:12:28 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-fs@FreeBSD.org Subject: [Bug 209396] ZFS primarycache attribute affects secondary cache as well Date: Fri, 20 May 2016 08:12:28 +0000 X-Bugzilla-Reason: AssignedTo X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: 10.3-RELEASE X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: linimon@FreeBSD.org X-Bugzilla-Status: New X-Bugzilla-Resolution: X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-fs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: assigned_to Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 08:12:28 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D209396 Mark Linimon changed: What |Removed |Added ---------------------------------------------------------------------------- Assignee|freebsd-bugs@FreeBSD.org |freebsd-fs@FreeBSD.org --=20 You are receiving this mail because: You are the assignee for the bug.= From owner-freebsd-fs@freebsd.org Fri May 20 09:23:56 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 9476CB42511 for ; Fri, 20 May 2016 09:23:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 7BA6110CB for ; Fri, 20 May 2016 09:23:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id 7AF17B42510; Fri, 20 May 2016 09:23:56 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 7A8A1B4250E for ; Fri, 20 May 2016 09:23:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0C0EE10C9 for ; Fri, 20 May 2016 09:23:55 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4K9Nnir092713 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Fri, 20 May 2016 12:23:49 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4K9Nnir092713 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4K9NmVK092712; Fri, 20 May 2016 12:23:48 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 20 May 2016 12:23:48 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs Message-ID: <20160520092348.GV89104@kib.kiev.ua> References: <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> <20160520074427.W1151@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160520074427.W1151@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 09:23:56 -0000 On Fri, May 20, 2016 at 09:27:38AM +1000, Bruce Evans wrote: > On Thu, 19 May 2016, Konstantin Belousov wrote: > > > On Thu, May 19, 2016 at 12:20:19PM +1000, Bruce Evans wrote: > >> On Thu, 19 May 2016, Bruce Evans wrote: > >> > >>> On Thu, 19 May 2016, Bruce Evans wrote: > >>> > >>>> ... > >>>> I think the following works to prevent multiple mounts via all of the > >>>> known buggy paths: early in every fsmount(): > >> > >> Here is a lightly tested version: > > > > There is no need to protect the si_mountpt with any locking, the field > > itself serves as a lock good enough, also preventing the parallel mounts > > of the same devices. I changed the assignement to atomic_cmpset, which > > is enough there. It is somewhat pity that this would reliably disable > > multiple ro mounts of the same volume. > > I used a mutex since it is simpler. > > I think your version needs atomic ops for resetting the pointer, and > maybe acquire/release too. It has locking that is very similar to a > mutex. Mutexes use _mtx_obtain_lock = atomic_cmpset_acq_ptr and > _mtx_release_lock = atomic_store_rel_ptr. This is already delicately > weak -- full sequential consistency is not required. Then on x86, > we (you) only recently finished optimizing atomic_store_rel so that > it is as weak as possible (just a compiler membar before an ordinary > store). > > Maybe even weaker locking is enough here, but this is too hard to > understand. Well, I do not think that barriers would add much there, since we really do not care about two almost parallel mounts to fail, and other locking provides enough synchronization points. On the other hand, having explicit barriers makes si_mountpt act as the real semaphore. Unlike mutex, it attributes the ownership of the device to a mount point, and not to the locking thread. So I added acq/rel. > > > There is no need to move assignment of NULL to dev->si_mountpt later > > in ffs_unmount(), the moment where the assignment is performed is safe > > for other thread to start another mount. > > I already noticed that it was almost as late as possible (could be moved > 1 statement later) and not worth moving. > > But to even reason about orders, you need atomic releases with acquire/ > release semantics. There are dev_ref() and dev_rel() calls nearby. The > implementation of these probably has to and in fact does give some ordering. > The details are too hard to understand. In ffs_unmount() I think it is > actually ordering given by vrele() that makes things work: > > > PICKUP_GIANT(); > > - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) > > - ump->um_devvp->v_rdev->si_mountpt = NULL; > > + ump->um_dev->si_mountpt = NULL; > > vrele(ump->um_devvp); > > dev_rel(ump->um_dev); > > We want the store to si_mountpt to become visible before the vnode is > unlocked. Otherwise, a new mount can lock the vnode and fail with > EBUSY because it sees si_mountpt != NULL. We have to know implementation > details of vrele() to know that this happens. Yes, we do not care about this window. For this to happen, mount request must be issued before the unmount request returned, and the mount caller is not able to prove that his mount attempt was started before the unmount progressed enough. > > > I still want to keep devvp locked for long enough to cover the bufobj > > hacking, and I do not want to move bufobj.bo_ops change before > > g_vfs_open() succeed. > > I didn't move it before g_vfs_open(), but before VOP_UNLOCK(). I think > v_bufobj is never cleared, but garbage in it is harmless except in the > multiple-mounts case which is now disallowed. So did I. > > > I also wanted to remove GIANT dances, but this requires geom patch, > > which I will mail separately. > > OK. I saw the other mail. > > > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > > index 712fc21..21425f5 100644 > > --- a/sys/ufs/ffs/ffs_vfsops.c > > +++ b/sys/ufs/ffs/ffs_vfsops.c > > @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) > > cred = td ? td->td_ucred : NOCRED; > > ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > > > + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > > dev = devvp->v_rdev; > > dev_ref(dev); > > + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { > > + dev_rel(dev); > > + VOP_UNLOCK(devvp, 0); > > + return (EBUSY); > > + } > > This is cleaner and safer than my version. > > > DROP_GIANT(); > > g_topology_lock(); > > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > > g_vfs_open() already sets devvp->v_bufobj.bo_ops to g_vfs_bufops unless > it fails. This clobbered our setting in the buggy multiple-mount case. > But with multiple mounts not allowed, this cleans up any garbage in > v_bufobj. Yes, and this orders things. g_vfs_open() shoudl have devvp locked, both fo bo manipulations and for vnode_create_vobject() call. We can only assign to bo_ops after g_vfs_open() was done successfully. > > g_vfs_open() has 2 failures for non-exclusive access. It starts by > checking v_bufobj.bo_private == devvp (this is after translating its > pointers to the ones passed here). This is avg's fix for the multiple- > mounts problem (r206130). It doesn't work in all cases. I think this > is unecessary now. At least it weeds out other devfs mounts. > > Later, g_vfs_open() does a g_access() check for exclusive-enough access. > This is supposed to allow multiple mounts at least when all are ro. I > thought that avg modified this, but he actually did something different. > I think this check only failed in buggy cases where multiple mounts were > allowed. Our changes should make it never fail. It still returns the > wrong errno (some general one return by g_access() instead of the one > documented for mount() -- this is EBUSY). > > > g_topology_unlock(); > > PICKUP_GIANT(); > > - VOP_UNLOCK(devvp, 0); > > I don't like moving this below devvp accesses. It locks devvp, not dev. > > > - if (error) > > + if (error != 0) { > > + VOP_UNLOCK(devvp, 0); > > goto out; > > - if (devvp->v_rdev->si_iosize_max != 0) > > - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; > > + } > > + if (dev->si_iosize_max != 0) > > + mp->mnt_iosize_max = dev->si_iosize_max; > > dev->si_iosize_max is locked by its undocumented lifetime. It is invariant > since some previous time. > > > if (mp->mnt_iosize_max > MAXPHYS) > > mp->mnt_iosize_max = MAXPHYS; > > - > > devvp->v_bufobj.bo_ops = &ffs_ops; > > This needs to be before the vnode unlock of course. > > I don't like the complication to avoid setting this if we g_vfs_open_fails, > but this at least makes it obvious that we don't set it to garbage when > g_vfs_open_fails. In other error cases, and even after unmount, I think > v_bufobj is left as garbage. > > I now see another cleanup: don't goto out when g_vfs_open() fails. This > depends on it setting cp to NULL and leaving nothing to clean when it > fails. It has no man page and this detail is documented in its source > code. Then I would need to add another NULL assignment, VOP_UNLOCK etc. > > > - if (devvp->v_type == VCHR) > > - devvp->v_rdev->si_mountpt = mp; > > + VOP_UNLOCK(devvp, 0); > > > > fs = NULL; > > sblockloc = 0; > > @@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td) > > out: > > if (bp) > > brelse(bp); > > - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) > > - devvp->v_rdev->si_mountpt = NULL; > > if (cp != NULL) { > > DROP_GIANT(); > > g_topology_lock(); > > @@ -1102,6 +1105,7 @@ out: > > free(ump, M_UFSMNT); > > mp->mnt_data = NULL; > > } > > + dev->si_mountpt = NULL; > > This should remain before the vnode unlock. Otherwise a new mount can > fail unnecessarily. See above. > > > dev_rel(dev); > > return (error); > > } > > @@ -1287,8 +1291,7 @@ ffs_unmount(mp, mntflags) > > g_vfs_close(ump->um_cp); > > g_topology_unlock(); > > PICKUP_GIANT(); > > - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) > > - ump->um_devvp->v_rdev->si_mountpt = NULL; > > + ump->um_dev->si_mountpt = NULL; > > vrele(ump->um_devvp); > > dev_rel(ump->um_dev); > > This order is better for avoiding unnecessary failure for new mounts, but > now I'm not sure if it is right. Anyway, it matters less to get an > unnecessary failures for a new mount after a long-lived old mount than > after a failed mount, so the cleanup shouldn't be stricter than here. > > > mtx_destroy(UFS_MTX(ump)); > > Updated patch to add acq/rel. diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index 712fc21..670bb15 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) cred = td ? td->td_ucred : NOCRED; ronly = (mp->mnt_flag & MNT_RDONLY) != 0; + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); dev = devvp->v_rdev; dev_ref(dev); + if (atomic_cmpset_acq_ptr(&dev->si_mountpt, 0, mp) != 0) { + dev_rel(dev); + VOP_UNLOCK(devvp, 0); + return (EBUSY); + } DROP_GIANT(); g_topology_lock(); error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); g_topology_unlock(); PICKUP_GIANT(); - VOP_UNLOCK(devvp, 0); - if (error) + if (error != 0) { + VOP_UNLOCK(devvp, 0); goto out; - if (devvp->v_rdev->si_iosize_max != 0) - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; + } + if (dev->si_iosize_max != 0) + mp->mnt_iosize_max = dev->si_iosize_max; if (mp->mnt_iosize_max > MAXPHYS) mp->mnt_iosize_max = MAXPHYS; - devvp->v_bufobj.bo_ops = &ffs_ops; - if (devvp->v_type == VCHR) - devvp->v_rdev->si_mountpt = mp; + VOP_UNLOCK(devvp, 0); fs = NULL; sblockloc = 0; @@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td) out: if (bp) brelse(bp); - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) - devvp->v_rdev->si_mountpt = NULL; if (cp != NULL) { DROP_GIANT(); g_topology_lock(); @@ -1102,6 +1105,7 @@ out: free(ump, M_UFSMNT); mp->mnt_data = NULL; } + atomic_store_rel_ptr(&dev->si_mountpt, 0); dev_rel(dev); return (error); } @@ -1287,8 +1291,7 @@ ffs_unmount(mp, mntflags) g_vfs_close(ump->um_cp); g_topology_unlock(); PICKUP_GIANT(); - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) - ump->um_devvp->v_rdev->si_mountpt = NULL; + atomic_store_rel_ptr(&ump->um_dev->si_mountpt, 0); vrele(ump->um_devvp); dev_rel(ump->um_dev); mtx_destroy(UFS_MTX(ump)); From owner-freebsd-fs@freebsd.org Fri May 20 11:22:11 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D416BB43A55 for ; Fri, 20 May 2016 11:22:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id BEFE21144 for ; Fri, 20 May 2016 11:22:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id BE28AB43A53; Fri, 20 May 2016 11:22:11 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id BDC33B43A52 for ; Fri, 20 May 2016 11:22:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail105.syd.optusnet.com.au (mail105.syd.optusnet.com.au [211.29.132.249]) by mx1.freebsd.org (Postfix) with ESMTP id 6F68A1142 for ; Fri, 20 May 2016 11:22:11 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c110-21-42-169.carlnfd1.nsw.optusnet.com.au [110.21.42.169]) by mail105.syd.optusnet.com.au (Postfix) with ESMTPS id 036AE104A09F; Fri, 20 May 2016 21:22:08 +1000 (AEST) Date: Fri, 20 May 2016 21:22:08 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160520092348.GV89104@kib.kiev.ua> Message-ID: <20160520194427.W1170@besplex.bde.org> References: <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> <20160520074427.W1151@besplex.bde.org> <20160520092348.GV89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=kDyANCGC9fy361NNEb9EQQ==:117 a=kDyANCGC9fy361NNEb9EQQ==:17 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=oUMX8HkAuKKT6YiDj2EA:9 a=Yzrsocd7YFqDQDc-:21 a=wDFIkm-ngTmmU6Y6:21 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 11:22:11 -0000 On Fri, 20 May 2016, Konstantin Belousov wrote: > On Fri, May 20, 2016 at 09:27:38AM +1000, Bruce Evans wrote: >> On Thu, 19 May 2016, Konstantin Belousov wrote: >> >>> On Thu, May 19, 2016 at 12:20:19PM +1000, Bruce Evans wrote: >>>> On Thu, 19 May 2016, Bruce Evans wrote: >>>> >>>>> On Thu, 19 May 2016, Bruce Evans wrote: >>>>>> ... >>>>>> I think the following works to prevent multiple mounts via all of the >>>>>> known buggy paths: early in every fsmount(): >>>> >>>> Here is a lightly tested version: I checked some details for r206130 again: - r206130 claims to allow only 1 mount per device node. It actually allows only 1 mount per vnode. - the case of separate vnodes seems to actually work almost as intended. This is most easily reached using separate devfs mounts. It gets the access counts right (they are combined). It has a chance of working because the separate vnodes provide a place to attach separate bufobjs. - the common si_mountpt of course can't work for multiple mounts, but clobbering it doesn't have to break anything more than the i/o counts. - I think it was only intended to allow multiple ro mounts. However, 1 rw mount is allowed after any number of ro mounts (using separate vnodes after r206130). ro after rw is also allowed, but then the ro mount prints the warning that the fs was not properly dismounted. - I think the behaviour in the previous point is a side affect of allowing fsck to write on a device open ro for a ro mount. fsck reloads for just one of the ro mounts. So multiple mounts are still too dangerous, and we should do finish r206130. >> I think your version needs atomic ops for resetting the pointer, and >> maybe acquire/release too. It has locking that is very similar to a >> ... >> Maybe even weaker locking is enough here, but this is too hard to >> understand. > Well, I do not think that barriers would add much there, since we really > do not care about two almost parallel mounts to fail, and other locking > provides enough synchronization points. On the other hand, having > explicit barriers makes si_mountpt act as the real semaphore. Unlike > mutex, it attributes the ownership of the device to a mount point, and > not to the locking thread. > > So I added acq/rel. Thanks. >>> ... >>> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c >>> index 712fc21..21425f5 100644 >>> --- a/sys/ufs/ffs/ffs_vfsops.c >>> +++ b/sys/ufs/ffs/ffs_vfsops.c >>> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) >>> cred = td ? td->td_ucred : NOCRED; >>> ronly = (mp->mnt_flag & MNT_RDONLY) != 0; >>> >>> + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); >>> dev = devvp->v_rdev; >>> dev_ref(dev); >>> + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { >>> + dev_rel(dev); >>> + VOP_UNLOCK(devvp, 0); >>> + return (EBUSY); >>> + } >> >> This is cleaner and safer than my version. >> >>> DROP_GIANT(); >>> g_topology_lock(); >>> error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); >> >> g_vfs_open() already sets devvp->v_bufobj.bo_ops to g_vfs_bufops unless >> it fails. This clobbered our setting in the buggy multiple-mount case. >> But with multiple mounts not allowed, this cleans up any garbage in >> v_bufobj. > Yes, and this orders things. g_vfs_open() shoudl have devvp locked, > both fo bo manipulations and for vnode_create_vobject() call. > We can only assign to bo_ops after g_vfs_open() was done successfully. The atomic cmpset now orders things too. Is that enough? It ensures that an old mount cannot be active. I don't know if v_bufobj is used for non-mounts. Except, for zfs there is no g_vfs_open() to order things, and for all other file systems there is no atomic cmpset yet. >> g_vfs_open() has 2 failures for non-exclusive access. It starts by >> checking v_bufobj.bo_private == devvp (this is after translating its >> pointers to the ones passed here). This is avg's fix for the multiple- >> mounts problem (r206130). It doesn't work in all cases. I think this >> is unecessary now. > At least it weeds out other devfs mounts. Yes, we need it until everything is converted. >> ... >> I now see another cleanup: don't goto out when g_vfs_open() fails. This >> depends on it setting cp to NULL and leaving nothing to clean when it >> fails. It has no man page and this detail is documented in its source >> code. > Then I would need to add another NULL assignment, VOP_UNLOCK etc. g_vfs_open() already sets cp to NULL when it fails, and the cleanup depends on that now, but it is just as good to depend on no cleanup being needed on failure. You do need another dev_rel(). I thought about moving the dev_ref() later to simplify the early returns. I thought that this didn't quite work. Now I think it does work, for obvious reasons: - the device is attached to a vnode, so it is referenced to prevent it going away unless the device is revoked. It seems to be referenced at least 3 times in FreeBSD-9. - the vnode is locked, so the reference count remains > 0 until we unlock. So we just need a dev_ref() before the unlock in the non-error case, to keep the device from going away if it is revoked. > Updated patch to add acq/rel. > > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > index 712fc21..670bb15 100644 > --- a/sys/ufs/ffs/ffs_vfsops.c > +++ b/sys/ufs/ffs/ffs_vfsops.c > @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) > cred = td ? td->td_ucred : NOCRED; > ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); Hrmph. > dev = devvp->v_rdev; > dev_ref(dev); Move later... > + if (atomic_cmpset_acq_ptr(&dev->si_mountpt, 0, mp) != 0) { I changed the first 0 to NULL, and this works on i386, but now I remember that i386 has bogus casts which break detection of type mismatches -- the atomic ptr functions take a [u]intptr_t, not a pointer type, so NULL won't work if it is ((void *)0). At least amd64 is still missing this bug. > + dev_rel(dev); ...then this dev_rel() is not needed. > + VOP_UNLOCK(devvp, 0); > + return (EBUSY); > + } > DROP_GIANT(); > g_topology_lock(); > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > g_topology_unlock(); > PICKUP_GIANT(); > - VOP_UNLOCK(devvp, 0); > - if (error) > + if (error != 0) { > + VOP_UNLOCK(devvp, 0); > goto out; This becomes: if (error != 0) { VOP_UNLOCK(devvp, 0); return (EBUSY); } Then assign v_bufobj. Then dev_ref(), just in time for unlocking. Then unlock. > - if (devvp->v_rdev->si_iosize_max != 0) > - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; > + } > + if (dev->si_iosize_max != 0) > + mp->mnt_iosize_max = dev->si_iosize_max; > if (mp->mnt_iosize_max > MAXPHYS) > mp->mnt_iosize_max = MAXPHYS; > - > devvp->v_bufobj.bo_ops = &ffs_ops; > - if (devvp->v_type == VCHR) > - devvp->v_rdev->si_mountpt = mp; > + VOP_UNLOCK(devvp, 0); This belongs earlier. > > fs = NULL; > sblockloc = 0; > ... We need this in a central function. g_vfs_open/close() can do it for all cases except zfs. This looks like: DROP_GIANT(); g_topology_lock(); // atomic_cmpset and its error = EBUSY moved to top of g_vfs_open() error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); g_topology_unlock(); PICKUP_GIANT(); if (error != 0) { VOP_UNLOCK(devvp, 0); return (error); } devvp->v_bufobj.bo_ops = &ffs_ops; dev_ref(dev); VOP_UNLOCK(devvp, 0); if (dev->si_iosize_max != 0) mp->mnt_iosize_max = dev->si_iosize_max; if (mp->mnt_iosize_max > MAXPHYS) mp->mnt_iosize_max = MAXPHYS; where 2 of 2 lines with GIANT and 3 of 4 lines with iosize_max remain to be cleaned up. Resetting si_mountpt in g_vfs_close() is even simpler. Oops, it also has to be reset in g_vfs_open() on a later failure there. Bruce From owner-freebsd-fs@freebsd.org Fri May 20 11:33:32 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 21112B43CC8 for ; Fri, 20 May 2016 11:33:32 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (unknown [127.0.1.3]) by mx1.freebsd.org (Postfix) with ESMTP id 0C7FD16F9 for ; Fri, 20 May 2016 11:33:32 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 0BC46B43CC6; Fri, 20 May 2016 11:33:32 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0B67CB43CC5 for ; Fri, 20 May 2016 11:33:32 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail106.syd.optusnet.com.au (mail106.syd.optusnet.com.au [211.29.132.42]) by mx1.freebsd.org (Postfix) with ESMTP id C85E516F8 for ; Fri, 20 May 2016 11:33:31 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from besplex.bde.org (c110-21-42-169.carlnfd1.nsw.optusnet.com.au [110.21.42.169]) by mail106.syd.optusnet.com.au (Postfix) with ESMTPS id EC3843C76BA; Fri, 20 May 2016 21:33:23 +1000 (AEST) Date: Fri, 20 May 2016 21:33:22 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Bruce Evans cc: Konstantin Belousov , fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160520194427.W1170@besplex.bde.org> Message-ID: <20160520212839.E1436@besplex.bde.org> References: <20160518061040.D5948@besplex.bde.org> <20160518070252.F6121@besplex.bde.org> <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> <20160520074427.W1151@besplex.bde.org> <20160520092348.GV89104@kib.kiev.ua> <20160520194427.W1170@besplex.bde.org> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=EfU1O6SC c=1 sm=1 tr=0 a=kDyANCGC9fy361NNEb9EQQ==:117 a=kDyANCGC9fy361NNEb9EQQ==:17 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=mxpNFrNJh-qTCabnfI8A:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 11:33:32 -0000 PS (sigh): On Fri, 20 May 2016, Bruce Evans wrote: > On Fri, 20 May 2016, Konstantin Belousov wrote: > >> On Fri, May 20, 2016 at 09:27:38AM +1000, Bruce Evans wrote: >>> ... >>> I now see another cleanup: don't goto out when g_vfs_open() fails. This >>> depends on it setting cp to NULL and leaving nothing to clean when it >>> fails. It has no man page and this detail is documented in its source >>> code. >> Then I would need to add another NULL assignment, VOP_UNLOCK etc. > > g_vfs_open() already sets cp to NULL when it fails, and the cleanup > depends on that now, but it is just as good to depend on no cleanup > being needed on failure. You do need another dev_rel(). Oops, you mean another NULL assignment (atomic op) for cleaning up si_mountpt. I got that right at the end where I moved things to g_vfs_open(). Bruce From owner-freebsd-fs@freebsd.org Fri May 20 14:27:04 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 57278B426E4 for ; Fri, 20 May 2016 14:27:04 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 3E68F1F34 for ; Fri, 20 May 2016 14:27:04 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id 3A065B426E2; Fri, 20 May 2016 14:27:04 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 39A6AB426E0 for ; Fri, 20 May 2016 14:27:04 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id BE0F61F33 for ; Fri, 20 May 2016 14:27:03 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4KEQt3e006945 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Fri, 20 May 2016 17:26:55 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4KEQt3e006945 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4KEQsxK006944; Fri, 20 May 2016 17:26:54 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 20 May 2016 17:26:54 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs Message-ID: <20160520142654.GW89104@kib.kiev.ua> References: <20160517220055.GF89104@kib.kiev.ua> <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> <20160520074427.W1151@besplex.bde.org> <20160520092348.GV89104@kib.kiev.ua> <20160520194427.W1170@besplex.bde.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160520194427.W1170@besplex.bde.org> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 14:27:04 -0000 On Fri, May 20, 2016 at 09:22:08PM +1000, Bruce Evans wrote: > On Fri, 20 May 2016, Konstantin Belousov wrote: > > > On Fri, May 20, 2016 at 09:27:38AM +1000, Bruce Evans wrote: > >>> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > >>> index 712fc21..21425f5 100644 > >>> --- a/sys/ufs/ffs/ffs_vfsops.c > >>> +++ b/sys/ufs/ffs/ffs_vfsops.c > >>> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) > >>> cred = td ? td->td_ucred : NOCRED; > >>> ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > >>> > >>> + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > >>> dev = devvp->v_rdev; > >>> dev_ref(dev); > >>> + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { > >>> + dev_rel(dev); > >>> + VOP_UNLOCK(devvp, 0); > >>> + return (EBUSY); > >>> + } > >> > >> This is cleaner and safer than my version. > >> > >>> DROP_GIANT(); > >>> g_topology_lock(); > >>> error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > >> > >> g_vfs_open() already sets devvp->v_bufobj.bo_ops to g_vfs_bufops unless > >> it fails. This clobbered our setting in the buggy multiple-mount case. > >> But with multiple mounts not allowed, this cleans up any garbage in > >> v_bufobj. > > Yes, and this orders things. g_vfs_open() shoudl have devvp locked, > > both fo bo manipulations and for vnode_create_vobject() call. > > We can only assign to bo_ops after g_vfs_open() was done successfully. > > The atomic cmpset now orders things too. Is that enough? It ensures > that an old mount cannot be active. I don't know if v_bufobj is used > for non-mounts. v_bufobj is logically protected against modifications by the vnode lock. > > Except, for zfs there is no g_vfs_open() to order things, and for all > other file systems there is no atomic cmpset yet. > > >> g_vfs_open() has 2 failures for non-exclusive access. It starts by > >> checking v_bufobj.bo_private == devvp (this is after translating its > >> pointers to the ones passed here). This is avg's fix for the multiple- > >> mounts problem (r206130). It doesn't work in all cases. I think this > >> is unecessary now. > > At least it weeds out other devfs mounts. > > Yes, we need it until everything is converted. > > >> ... > >> I now see another cleanup: don't goto out when g_vfs_open() fails. This > >> depends on it setting cp to NULL and leaving nothing to clean when it > >> fails. It has no man page and this detail is documented in its source > >> code. > > Then I would need to add another NULL assignment, VOP_UNLOCK etc. > > g_vfs_open() already sets cp to NULL when it fails, and the cleanup > depends on that now, but it is just as good to depend on no cleanup > being needed on failure. You do need another dev_rel(). > > I thought about moving the dev_ref() later to simplify the early returns. > I thought that this didn't quite work. Now I think it does work, for > obvious reasons: > - the device is attached to a vnode, so it is referenced to prevent it > going away unless the device is revoked. It seems to be referenced > at least 3 times in FreeBSD-9. > - the vnode is locked, so the reference count remains > 0 until we unlock. > So we just need a dev_ref() before the unlock in the non-error case, to > keep the device from going away if it is revoked. Yes, and this is how the current patched code is structured. > > > Updated patch to add acq/rel. > > > > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > > index 712fc21..670bb15 100644 > > --- a/sys/ufs/ffs/ffs_vfsops.c > > +++ b/sys/ufs/ffs/ffs_vfsops.c > > @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) > > cred = td ? td->td_ucred : NOCRED; > > ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > > > + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > > Hrmph. I want this, it would remove amount of obvious questions. > > > dev = devvp->v_rdev; > > dev_ref(dev); > > Move later... > > > + if (atomic_cmpset_acq_ptr(&dev->si_mountpt, 0, mp) != 0) { > > I changed the first 0 to NULL, and this works on i386, but now I remember > that i386 has bogus casts which break detection of type mismatches -- > the atomic ptr functions take a [u]intptr_t, not a pointer type, so > NULL won't work if it is ((void *)0). At least amd64 is still missing > this bug. cmpset__ptr() on i386 has cast for old and new parameters to u_int. store_rel_ptr() on i386 does not cast value to u_int. As result, NULL is acceptable for cmpset, but not for store. I spelled it 0 in all cases. Hm, I also should add uintptr_t cast for cmpset, otherwise, I suspect, some arch might be broken. > > > + dev_rel(dev); > > ...then this dev_rel() is not needed. > > > + VOP_UNLOCK(devvp, 0); > > + return (EBUSY); > > + } > > DROP_GIANT(); > > g_topology_lock(); > > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > > g_topology_unlock(); > > PICKUP_GIANT(); > > - VOP_UNLOCK(devvp, 0); > > - if (error) > > + if (error != 0) { > > + VOP_UNLOCK(devvp, 0); > > goto out; > > This becomes: > > if (error != 0) { > VOP_UNLOCK(devvp, 0); > return (EBUSY); > } > > Then assign v_bufobj. > > Then dev_ref(), just in time for unlocking. > > Then unlock. Ok. > > > - if (devvp->v_rdev->si_iosize_max != 0) > > - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; > > + } > > + if (dev->si_iosize_max != 0) > > + mp->mnt_iosize_max = dev->si_iosize_max; > > if (mp->mnt_iosize_max > MAXPHYS) > > mp->mnt_iosize_max = MAXPHYS; > > - > > devvp->v_bufobj.bo_ops = &ffs_ops; > > - if (devvp->v_type == VCHR) > > - devvp->v_rdev->si_mountpt = mp; > > + VOP_UNLOCK(devvp, 0); > > This belongs earlier. > > > > > fs = NULL; > > sblockloc = 0; > > ... > > We need this in a central function. g_vfs_open/close() can do it for > all cases except zfs. This looks like: I might look at this later. > > DROP_GIANT(); > g_topology_lock(); > // atomic_cmpset and its error = EBUSY moved to top of g_vfs_open() > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > g_topology_unlock(); > PICKUP_GIANT(); > if (error != 0) { > VOP_UNLOCK(devvp, 0); > return (error); > } > devvp->v_bufobj.bo_ops = &ffs_ops; > dev_ref(dev); > VOP_UNLOCK(devvp, 0); > if (dev->si_iosize_max != 0) > mp->mnt_iosize_max = dev->si_iosize_max; > if (mp->mnt_iosize_max > MAXPHYS) > mp->mnt_iosize_max = MAXPHYS; > > where 2 of 2 lines with GIANT and 3 of 4 lines with iosize_max remain to > be cleaned up. > > Resetting si_mountpt in g_vfs_close() is even simpler. Oops, it also has > to be reset in g_vfs_open() on a later failure there. diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index 712fc21..65b1891 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -764,25 +764,30 @@ ffs_mountfs(devvp, mp, td) cred = td ? td->td_ucred : NOCRED; ronly = (mp->mnt_flag & MNT_RDONLY) != 0; + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); dev = devvp->v_rdev; - dev_ref(dev); + if (atomic_cmpset_acq_ptr(&dev->si_mountpt, 0, (uintptr_t)mp) != 0) { + VOP_UNLOCK(devvp, 0); + return (EBUSY); + } DROP_GIANT(); g_topology_lock(); error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); g_topology_unlock(); PICKUP_GIANT(); + if (error != 0) { + VOP_UNLOCK(devvp, 0); + atomic_store_rel_ptr(&dev->si_mountpt, 0); + return (error); + } + dev_ref(dev); + devvp->v_bufobj.bo_ops = &ffs_ops; VOP_UNLOCK(devvp, 0); - if (error) - goto out; - if (devvp->v_rdev->si_iosize_max != 0) - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; + if (dev->si_iosize_max != 0) + mp->mnt_iosize_max = dev->si_iosize_max; if (mp->mnt_iosize_max > MAXPHYS) mp->mnt_iosize_max = MAXPHYS; - devvp->v_bufobj.bo_ops = &ffs_ops; - if (devvp->v_type == VCHR) - devvp->v_rdev->si_mountpt = mp; - fs = NULL; sblockloc = 0; /* @@ -1083,8 +1088,6 @@ ffs_mountfs(devvp, mp, td) out: if (bp) brelse(bp); - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) - devvp->v_rdev->si_mountpt = NULL; if (cp != NULL) { DROP_GIANT(); g_topology_lock(); @@ -1102,6 +1105,7 @@ out: free(ump, M_UFSMNT); mp->mnt_data = NULL; } + atomic_store_rel_ptr(&dev->si_mountpt, 0); dev_rel(dev); return (error); } @@ -1287,8 +1291,7 @@ ffs_unmount(mp, mntflags) g_vfs_close(ump->um_cp); g_topology_unlock(); PICKUP_GIANT(); - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) - ump->um_devvp->v_rdev->si_mountpt = NULL; + atomic_store_rel_ptr(&ump->um_dev->si_mountpt, 0); vrele(ump->um_devvp); dev_rel(ump->um_dev); mtx_destroy(UFS_MTX(ump)); From owner-freebsd-fs@freebsd.org Fri May 20 15:36:56 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id C0498B43D92 for ; Fri, 20 May 2016 15:36:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id A76CA1062 for ; Fri, 20 May 2016 15:36:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: by mailman.ysv.freebsd.org (Postfix) id A67A4B43D90; Fri, 20 May 2016 15:36:56 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id A621FB43D8F for ; Fri, 20 May 2016 15:36:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1 with cipher DHE-RSA-CAMELLIA256-SHA (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4FB651061 for ; Fri, 20 May 2016 15:36:56 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id u4KFan4k023726 (version=TLSv1 cipher=DHE-RSA-CAMELLIA256-SHA bits=256 verify=NO); Fri, 20 May 2016 18:36:50 +0300 (EEST) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua u4KFan4k023726 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id u4KFan5G023725; Fri, 20 May 2016 18:36:49 +0300 (EEST) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Fri, 20 May 2016 18:36:49 +0300 From: Konstantin Belousov To: Bruce Evans Cc: fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs Message-ID: <20160520153649.GX89104@kib.kiev.ua> References: <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> <20160520074427.W1151@besplex.bde.org> <20160520092348.GV89104@kib.kiev.ua> <20160520194427.W1170@besplex.bde.org> <20160520142654.GW89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160520142654.GW89104@kib.kiev.ua> User-Agent: Mutt/1.6.1 (2016-04-27) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 15:36:56 -0000 On Fri, May 20, 2016 at 05:26:54PM +0300, Konstantin Belousov wrote: > On Fri, May 20, 2016 at 09:22:08PM +1000, Bruce Evans wrote: > > On Fri, 20 May 2016, Konstantin Belousov wrote: > > > > > On Fri, May 20, 2016 at 09:27:38AM +1000, Bruce Evans wrote: > > >>> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > > >>> index 712fc21..21425f5 100644 > > >>> --- a/sys/ufs/ffs/ffs_vfsops.c > > >>> +++ b/sys/ufs/ffs/ffs_vfsops.c > > >>> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) > > >>> cred = td ? td->td_ucred : NOCRED; > > >>> ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > >>> > > >>> + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > > >>> dev = devvp->v_rdev; > > >>> dev_ref(dev); > > >>> + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { > > >>> + dev_rel(dev); > > >>> + VOP_UNLOCK(devvp, 0); > > >>> + return (EBUSY); > > >>> + } > > >> > > >> This is cleaner and safer than my version. > > >> > > >>> DROP_GIANT(); > > >>> g_topology_lock(); > > >>> error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > > >> > > >> g_vfs_open() already sets devvp->v_bufobj.bo_ops to g_vfs_bufops unless > > >> it fails. This clobbered our setting in the buggy multiple-mount case. > > >> But with multiple mounts not allowed, this cleans up any garbage in > > >> v_bufobj. > > > Yes, and this orders things. g_vfs_open() shoudl have devvp locked, > > > both fo bo manipulations and for vnode_create_vobject() call. > > > We can only assign to bo_ops after g_vfs_open() was done successfully. > > > > The atomic cmpset now orders things too. Is that enough? It ensures > > that an old mount cannot be active. I don't know if v_bufobj is used > > for non-mounts. > v_bufobj is logically protected against modifications by the vnode lock. > > > > > Except, for zfs there is no g_vfs_open() to order things, and for all > > other file systems there is no atomic cmpset yet. > > > > >> g_vfs_open() has 2 failures for non-exclusive access. It starts by > > >> checking v_bufobj.bo_private == devvp (this is after translating its > > >> pointers to the ones passed here). This is avg's fix for the multiple- > > >> mounts problem (r206130). It doesn't work in all cases. I think this > > >> is unecessary now. > > > At least it weeds out other devfs mounts. > > > > Yes, we need it until everything is converted. > > > > >> ... > > >> I now see another cleanup: don't goto out when g_vfs_open() fails. This > > >> depends on it setting cp to NULL and leaving nothing to clean when it > > >> fails. It has no man page and this detail is documented in its source > > >> code. > > > Then I would need to add another NULL assignment, VOP_UNLOCK etc. > > > > g_vfs_open() already sets cp to NULL when it fails, and the cleanup > > depends on that now, but it is just as good to depend on no cleanup > > being needed on failure. You do need another dev_rel(). > > > > I thought about moving the dev_ref() later to simplify the early returns. > > I thought that this didn't quite work. Now I think it does work, for > > obvious reasons: > > - the device is attached to a vnode, so it is referenced to prevent it > > going away unless the device is revoked. It seems to be referenced > > at least 3 times in FreeBSD-9. > > - the vnode is locked, so the reference count remains > 0 until we unlock. > > So we just need a dev_ref() before the unlock in the non-error case, to > > keep the device from going away if it is revoked. > Yes, and this is how the current patched code is structured. > > > > > > Updated patch to add acq/rel. > > > > > > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > > > index 712fc21..670bb15 100644 > > > --- a/sys/ufs/ffs/ffs_vfsops.c > > > +++ b/sys/ufs/ffs/ffs_vfsops.c > > > @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) > > > cred = td ? td->td_ucred : NOCRED; > > > ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > > > > > + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); > > > > Hrmph. > I want this, it would remove amount of obvious questions. > > > > > > dev = devvp->v_rdev; > > > dev_ref(dev); > > > > Move later... > > > > > + if (atomic_cmpset_acq_ptr(&dev->si_mountpt, 0, mp) != 0) { > > > > I changed the first 0 to NULL, and this works on i386, but now I remember > > that i386 has bogus casts which break detection of type mismatches -- > > the atomic ptr functions take a [u]intptr_t, not a pointer type, so > > NULL won't work if it is ((void *)0). At least amd64 is still missing > > this bug. > cmpset__ptr() on i386 has cast for old and new parameters to u_int. > store_rel_ptr() on i386 does not cast value to u_int. As result, NULL > is acceptable for cmpset, but not for store. I spelled it 0 in all cases. > > Hm, I also should add uintptr_t cast for cmpset, otherwise, I suspect, > some arch might be broken. Even more casts are needed, updated patch is below. > > > > > > + dev_rel(dev); > > > > ...then this dev_rel() is not needed. > > > > > + VOP_UNLOCK(devvp, 0); > > > + return (EBUSY); > > > + } > > > DROP_GIANT(); > > > g_topology_lock(); > > > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > > > g_topology_unlock(); > > > PICKUP_GIANT(); > > > - VOP_UNLOCK(devvp, 0); > > > - if (error) > > > + if (error != 0) { > > > + VOP_UNLOCK(devvp, 0); > > > goto out; > > > > This becomes: > > > > if (error != 0) { > > VOP_UNLOCK(devvp, 0); > > return (EBUSY); > > } > > > > Then assign v_bufobj. > > > > Then dev_ref(), just in time for unlocking. > > > > Then unlock. > Ok. > > > > > > - if (devvp->v_rdev->si_iosize_max != 0) > > > - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; > > > + } > > > + if (dev->si_iosize_max != 0) > > > + mp->mnt_iosize_max = dev->si_iosize_max; > > > if (mp->mnt_iosize_max > MAXPHYS) > > > mp->mnt_iosize_max = MAXPHYS; > > > - > > > devvp->v_bufobj.bo_ops = &ffs_ops; > > > - if (devvp->v_type == VCHR) > > > - devvp->v_rdev->si_mountpt = mp; > > > + VOP_UNLOCK(devvp, 0); > > > > This belongs earlier. > > > > > > > > fs = NULL; > > > sblockloc = 0; > > > ... > > > > We need this in a central function. g_vfs_open/close() can do it for > > all cases except zfs. This looks like: > I might look at this later. > > > > > DROP_GIANT(); > > g_topology_lock(); > > // atomic_cmpset and its error = EBUSY moved to top of g_vfs_open() > > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > > g_topology_unlock(); > > PICKUP_GIANT(); > > if (error != 0) { > > VOP_UNLOCK(devvp, 0); > > return (error); > > } > > devvp->v_bufobj.bo_ops = &ffs_ops; > > dev_ref(dev); > > VOP_UNLOCK(devvp, 0); > > if (dev->si_iosize_max != 0) > > mp->mnt_iosize_max = dev->si_iosize_max; > > if (mp->mnt_iosize_max > MAXPHYS) > > mp->mnt_iosize_max = MAXPHYS; > > > > where 2 of 2 lines with GIANT and 3 of 4 lines with iosize_max remain to > > be cleaned up. > > > > Resetting si_mountpt in g_vfs_close() is even simpler. Oops, it also has > > to be reset in g_vfs_open() on a later failure there. diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c index 712fc21..0487c2f 100644 --- a/sys/ufs/ffs/ffs_vfsops.c +++ b/sys/ufs/ffs/ffs_vfsops.c @@ -764,25 +764,31 @@ ffs_mountfs(devvp, mp, td) cred = td ? td->td_ucred : NOCRED; ronly = (mp->mnt_flag & MNT_RDONLY) != 0; + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); dev = devvp->v_rdev; - dev_ref(dev); + if (atomic_cmpset_acq_ptr((uintptr_t *)&dev->si_mountpt, 0, + (uintptr_t)mp) != 0) { + VOP_UNLOCK(devvp, 0); + return (EBUSY); + } DROP_GIANT(); g_topology_lock(); error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); g_topology_unlock(); PICKUP_GIANT(); + if (error != 0) { + VOP_UNLOCK(devvp, 0); + atomic_store_rel_ptr((uintptr_t *)&dev->si_mountpt, 0); + return (error); + } + dev_ref(dev); + devvp->v_bufobj.bo_ops = &ffs_ops; VOP_UNLOCK(devvp, 0); - if (error) - goto out; - if (devvp->v_rdev->si_iosize_max != 0) - mp->mnt_iosize_max = devvp->v_rdev->si_iosize_max; + if (dev->si_iosize_max != 0) + mp->mnt_iosize_max = dev->si_iosize_max; if (mp->mnt_iosize_max > MAXPHYS) mp->mnt_iosize_max = MAXPHYS; - devvp->v_bufobj.bo_ops = &ffs_ops; - if (devvp->v_type == VCHR) - devvp->v_rdev->si_mountpt = mp; - fs = NULL; sblockloc = 0; /* @@ -1083,8 +1089,6 @@ ffs_mountfs(devvp, mp, td) out: if (bp) brelse(bp); - if (devvp->v_type == VCHR && devvp->v_rdev != NULL) - devvp->v_rdev->si_mountpt = NULL; if (cp != NULL) { DROP_GIANT(); g_topology_lock(); @@ -1102,6 +1106,7 @@ out: free(ump, M_UFSMNT); mp->mnt_data = NULL; } + atomic_store_rel_ptr((uintptr_t *)&dev->si_mountpt, 0); dev_rel(dev); return (error); } @@ -1287,8 +1292,7 @@ ffs_unmount(mp, mntflags) g_vfs_close(ump->um_cp); g_topology_unlock(); PICKUP_GIANT(); - if (ump->um_devvp->v_type == VCHR && ump->um_devvp->v_rdev != NULL) - ump->um_devvp->v_rdev->si_mountpt = NULL; + atomic_store_rel_ptr((uintptr_t *)&ump->um_dev->si_mountpt, 0); vrele(ump->um_devvp); dev_rel(ump->um_dev); mtx_destroy(UFS_MTX(ump)); From owner-freebsd-fs@freebsd.org Fri May 20 20:31:23 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id F2EB0B44717 for ; Fri, 20 May 2016 20:31:23 +0000 (UTC) (envelope-from jessie.carla@mrpwashtech.com) Received: from mrpwashtech.com (mrpwashtech.com [162.144.116.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id CF38F10BF for ; Fri, 20 May 2016 20:31:23 +0000 (UTC) (envelope-from jessie.carla@mrpwashtech.com) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=mrpwashtech.com; s=default; h=Content-Type:MIME-Version:Message-ID:Date: Subject:To:From:Sender:Reply-To:Cc:Content-Transfer-Encoding:Content-ID: Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc :Resent-Message-ID:In-Reply-To:References:List-Id:List-Help:List-Unsubscribe: List-Subscribe:List-Post:List-Owner:List-Archive; bh=2Irct9d+wbDoWV/U2ladyIORFamJzsDzK+LTF4o8UZQ=; b=GQiS4FhkxeImw+xjQh9ZjJq9vp 3B+CEymBArftN2WpP99P/bTH/XOZKH8Mwm2PAB6C0E+V25P+rDh/V1rC5xCCFkS4Tab4KuuH4JZAh qSjKYlUJcVbF1JVGezcY8jITQad50jOY47twxLmNx9muYEIgj0ggp4ju/ZmRw3DjB+gh3XL20N7Cq WV15RVr1u9BAcsFRs6Jx9YnTqOaiWfNPAX0afREUMUv0h+W+TOIwptv9d9l1g2ueqKgyZymthhBf4 UmX/gnGjHVZ4IhnWaHxIl5Q51KbiEsyxIk6zmakItFrkPWr0EMmSuKXa64YNkYI6tigXy0qHQhJmF X/nQYNfg==; Received: from [103.227.96.165] (port=8586 helo=md3PC) by 162-144-193-241.webhostbox.net with esmtpa (Exim 4.87) (envelope-from ) id 1b2zhL-0002Xw-An for freebsd-fs@freebsd.org; Wed, 18 May 2016 11:31:31 +0000 From: "Jessie Carla" To: Subject: Healthcare Email List 2016 Date: Wed, 18 May 2016 06:31:24 -0500 Message-ID: MIME-Version: 1.0 X-Mailer: Microsoft Outlook 15.0 Thread-Index: AdGw+LY69vQQ4UTqR++t5MQOARt+sw== Content-Language: en-us X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - 162-144-193-241.webhostbox.net X-AntiAbuse: Original Domain - freebsd.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - mrpwashtech.com X-Get-Message-Sender-Via: 162-144-193-241.webhostbox.net: authenticated_id: jessie.carla@mrpwashtech.com X-Authenticated-Sender: 162-144-193-241.webhostbox.net: jessie.carla@mrpwashtech.com X-Source: X-Source-Args: X-Source-Dir: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 20 May 2016 20:31:24 -0000 Hi, I went to your company profile and understood that would be interested in Healthcare Industry List 2016. We provide Data for any country/state Our Specialties: Hospitals, Doctors and Physician with specialties, Radiology, Allergy & Immunology, Orthopedics, Anesthesiology, Cardiology, Dermatology, Endocrinology, Diabetes & Metabolism, Emergency Medicine, Family Practice/General Practice, Geriatrics, Internal Medicine, Medical Genetics, Neurology, Obstetrics & Gynecology, Oncology (Cancer), Ophthalmology, Otolaryngology, Pathology, Pediatrics, Physical Medicine & Rehabilitation, Plastic Surgery, Preventative Medicine, Psychiatry, Surgery, Urology, and Other. how can our List benefit your Business? * Effortless reach your target audience. * lists can be customized as per client requirement. * Updated,accurate and validated email lists. * Direct personalized marketing. * Increase sales and improve your bussiness network. * Promote to message to top executives. * Increase brand visibility and ROI. * Target the key professtionals such as c-level vp-level excecutives Note: If Healthcare industry is not relevant to you please reply back with your Target Market, we have all types of "Target market" available. We have industries list like: Telecom, manufacturing, Oil & Gas, Pharmacy, Software, Retail, Real Estate, Construction, Energy, Government, Banking, Legal, Transportation, Wholesale, Agriculture, Business Service, Marketing, Education, Hospitality And Media Internet. Please let me know your thoughts, Regards, Jessie Carla Database Consultant. US Data - European Data - Email Append - Data Append - Technology Specific Data - Email Marketing. If you don't wish to receive our newsletters, reply back with "EXCLUDE ME" in subject line. From owner-freebsd-fs@freebsd.org Sat May 21 01:48:23 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 91E78B430AF for ; Sat, 21 May 2016 01:48:23 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mailman.ysv.freebsd.org (mailman.ysv.freebsd.org [IPv6:2001:1900:2254:206a::50:5]) by mx1.freebsd.org (Postfix) with ESMTP id 7CA4E1739 for ; Sat, 21 May 2016 01:48:23 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: by mailman.ysv.freebsd.org (Postfix) id 78465B430AE; Sat, 21 May 2016 01:48:23 +0000 (UTC) Delivered-To: fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 77ED7B430AD for ; Sat, 21 May 2016 01:48:23 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from mail110.syd.optusnet.com.au (mail110.syd.optusnet.com.au [211.29.132.97]) by mx1.freebsd.org (Postfix) with ESMTP id 278E01738 for ; Sat, 21 May 2016 01:48:22 +0000 (UTC) (envelope-from brde@optusnet.com.au) Received: from c122-106-149-109.carlnfd1.nsw.optusnet.com.au (c122-106-149-109.carlnfd1.nsw.optusnet.com.au [122.106.149.109]) by mail110.syd.optusnet.com.au (Postfix) with ESMTPS id E7CAE784D81; Sat, 21 May 2016 11:48:20 +1000 (AEST) Date: Sat, 21 May 2016 11:48:20 +1000 (EST) From: Bruce Evans X-X-Sender: bde@besplex.bde.org To: Konstantin Belousov cc: Bruce Evans , fs@freebsd.org Subject: Re: fix for per-mount i/o counting in ffs In-Reply-To: <20160520153649.GX89104@kib.kiev.ua> Message-ID: <20160521111424.T1652@besplex.bde.org> References: <20160518084931.T6534@besplex.bde.org> <20160518110834.GJ89104@kib.kiev.ua> <20160519065714.H1393@besplex.bde.org> <20160519094901.O1798@besplex.bde.org> <20160519120557.A2250@besplex.bde.org> <20160519104128.GN89104@kib.kiev.ua> <20160520074427.W1151@besplex.bde.org> <20160520092348.GV89104@kib.kiev.ua> <20160520194427.W1170@besplex.bde.org> <20160520142654.GW89104@kib.kiev.ua> <20160520153649.GX89104@kib.kiev.ua> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.1 cv=TuMb/2jh c=1 sm=1 tr=0 a=R/f3m204ZbWUO/0rwPSMPw==:117 a=L9H7d07YOLsA:10 a=9cW_t1CCXrUA:10 a=s5jvgZ67dGcA:10 a=kj9zAlcOel0A:10 a=WlGRK_V-j6VVmBx-EXEA:9 a=CjuIK1q_8ugA:10 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 May 2016 01:48:23 -0000 On Fri, 20 May 2016, Konstantin Belousov wrote: [This is the version with uintptr_t casts] > On Fri, May 20, 2016 at 05:26:54PM +0300, Konstantin Belousov wrote: >> On Fri, May 20, 2016 at 09:22:08PM +1000, Bruce Evans wrote: >>> On Fri, 20 May 2016, Konstantin Belousov wrote: >>> >>>> On Fri, May 20, 2016 at 09:27:38AM +1000, Bruce Evans wrote: >>>>>> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c >>>>>> index 712fc21..21425f5 100644 >>>>>> --- a/sys/ufs/ffs/ffs_vfsops.c >>>>>> +++ b/sys/ufs/ffs/ffs_vfsops.c >>>>>> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) >>>>>> cred = td ? td->td_ucred : NOCRED; >>>>>> ronly = (mp->mnt_flag & MNT_RDONLY) != 0; >>>>>> >>>>>> + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); >>>>>> dev = devvp->v_rdev; >>>>>> dev_ref(dev); >>>>>> + if (!atomic_cmpset_ptr(&dev->si_mountpt, 0, mp)) { >>>>>> + dev_rel(dev); >>>>>> + VOP_UNLOCK(devvp, 0); >>>>>> + return (EBUSY); >>>>>> + } >[*] >>> The atomic cmpset now orders things too. Is that enough? It ensures >>> that an old mount cannot be active. I don't know if v_bufobj is used >>> for non-mounts. >> v_bufobj is logically protected against modifications by the vnode lock. I meant "is it enough if we drop the vnode lock earlier". Also, what protects fro later accesses and modifications? >>>> diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c >>>> index 712fc21..670bb15 100644 >>>> --- a/sys/ufs/ffs/ffs_vfsops.c >>>> +++ b/sys/ufs/ffs/ffs_vfsops.c >>>> @@ -764,24 +764,29 @@ ffs_mountfs(devvp, mp, td) >>>> cred = td ? td->td_ucred : NOCRED; >>>> ronly = (mp->mnt_flag & MNT_RDONLY) != 0; >>>> >>>> + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); >>> >>> Hrmph. >> I want this, it would remove amount of obvious questions. But it gives negatively useful runtime checking... >>>> dev = devvp->v_rdev; >>>> dev_ref(dev); ...this gives better runtime checking. It gives a nice restartable null pointer trap except in the INVARIANTS case the KASSERT() gives a non- restartable panic. >> cmpset__ptr() on i386 has cast for old and new parameters to u_int. >> store_rel_ptr() on i386 does not cast value to u_int. As result, NULL >> is acceptable for cmpset, but not for store. I spelled it 0 in all cases. >> >> Hm, I also should add uintptr_t cast for cmpset, otherwise, I suspect, >> some arch might be broken. > Even more casts are needed, updated patch is below. These are necessary, unfortunately. Perhaps 32-bit arches need the bogus casts more because of the difference in pointer sizes. The caller might have a long variable and expect this to work the same as an int variable on 32-bit arches because the sizes are the same, but compilers should detect this type mismatch (for pointers to these types). On 64-bit arches, callers must be more careful and use only long variables. There are also signedness problems. Plain int and plain long shouldn't work since only unsigned variables are supported. Compilers should detect this too, but traditionally they were sloppier about this, and we apparently don't usually enable the warning flag for this. > diff --git a/sys/ufs/ffs/ffs_vfsops.c b/sys/ufs/ffs/ffs_vfsops.c > index 712fc21..0487c2f 100644 > --- a/sys/ufs/ffs/ffs_vfsops.c > +++ b/sys/ufs/ffs/ffs_vfsops.c > @@ -764,25 +764,31 @@ ffs_mountfs(devvp, mp, td) > cred = td ? td->td_ucred : NOCRED; > ronly = (mp->mnt_flag & MNT_RDONLY) != 0; > > + KASSERT(devvp->v_type == VCHR, ("reclaimed devvp")); Hrmph. > dev = devvp->v_rdev; > - dev_ref(dev); > + if (atomic_cmpset_acq_ptr((uintptr_t *)&dev->si_mountpt, 0, > + (uintptr_t)mp) != 0) { > + VOP_UNLOCK(devvp, 0); > + return (EBUSY); > + } > DROP_GIANT(); > g_topology_lock(); > error = g_vfs_open(devvp, &cp, "ffs", ronly ? 0 : 1); > g_topology_unlock(); > PICKUP_GIANT(); > + if (error != 0) { > + VOP_UNLOCK(devvp, 0); > + atomic_store_rel_ptr((uintptr_t *)&dev->si_mountpt, 0); The store must be before the unlock (since we don't hold a reference to the dev yet, the dev may go away on revoke after unlock). Otherwise OK. > + return (error); > + } > + dev_ref(dev); Bruce From owner-freebsd-fs@freebsd.org Sat May 21 04:22:33 2016 Return-Path: Delivered-To: freebsd-fs@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DC072B445E7 for ; Sat, 21 May 2016 04:22:33 +0000 (UTC) (envelope-from pr@fiemail.ru) Received: from fiemail.ru (fiemail.ru [185.144.30.80]) (using TLSv1.2 with cipher DHE-RSA-AES128-SHA (128/128 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 972E9171D for ; Sat, 21 May 2016 04:22:33 +0000 (UTC) (envelope-from pr@fiemail.ru) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=fiemail.ru; s=mail; h=Message-Id:Date:From:To:Subject:MIME-Version:Content-Type; bh=/n58EBxLWxpv/ZfNcQwwfEgv0FZdmeF6mBi6+NF2nSk=; b=Ju7k2eh9k2W+J4E8oFysQg6SuRkXR5ZjYbByT5P6YbpQ7vNghTRMeIb6lZKCpUAY9XXvX+kig/5sCYL0+7QDkKeP+Ft/mMgochjobYApW0IDywI/OX5pCLcPY3Ie4J/atjlYIETWMskfR8B1EToek95Ohrg/KK4NRH7tvKblVXY=; Received: by fiemail.ru with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:256) (Exim 4.80) id 1b1hjV-0002OX-8w; Sun, 15 May 2016 03:08:25 +0500 MIME-Version: 1.0 Subject: freebsd-fs@freebsd.org Your Account was Blacklisted. Verify freebsd-fs@freebsd.org To: freebsd-fs@freebsd.org From: Email Admin Date: Sat, 14 May 2016 15:08:17 -0700 Message-Id: Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Description: Mail message body X-Content-Filtered-By: Mailman/MimeDel 2.1.22 X-BeenThere: freebsd-fs@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Filesystems List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 May 2016 04:22:34 -0000 Your account was blacklisted We've detected that your account was blackisted. To serve you better, we= 've required a verification process. If you do not delist your email before 16th May 2016, incoming messages = to your email account will be returned to the sender and you won=92t be abl= e to send new messages. You will not be able to sync or upload additional i= tems such as "pdf,jpg,gif,msdocs,jpeg" files to your Drive or add photos. Please review your account activity and we'll help you take corrective act= ion. Re-Verify Now To opt out or change where you receive security notifications, click her= e. Note: Failure to respond to this message will result to deactivation of ac= count mailbox from the database. Thanks, The Webmail Account Team.