From owner-freebsd-scsi@freebsd.org Mon Jan 16 02:40:20 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 0E23ECB2D87; Mon, 16 Jan 2017 02:40:20 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: from mail-wm0-x22a.google.com (mail-wm0-x22a.google.com [IPv6:2a00:1450:400c:c09::22a]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 9BBC61FC5; Mon, 16 Jan 2017 02:40:19 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: by mail-wm0-x22a.google.com with SMTP id c206so155334187wme.0; Sun, 15 Jan 2017 18:40:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=IkUw+lRenFwsponY9eox5D3RbhR1NjEGnbcBVFyhyBw=; b=ja7L/wguo0HfsbwJSmM7OfMgYejiRX0xRo9rZkeODwuK1GsbWiI2pT8mBJkWD5RZpz Ziz2AEXxjwd2EzAFEJj1wU66H08FbU3dM5zAFfzVfBmSw2+lcos5XOUqfZgZ34CiQM58 JOI7AKA7jomqFS/Qe54alTwTOLyJblTNlBYtG8KuHEN/Syla8LXTFRHhgSq0vMjwuZBh I92jlnbQ6V2uqyEJtWFFcjo9IvSm/Ul2QxZfsf+ukb5KvZRBCiHkUCNFgoUdEZ0CzZ+O /SqWY2ZLFEucNV5Zk+92e5pqxkBQ+OpzDo58VCH04sLZNVdUVccL8N8EnCuYUE9AM1BH lAuA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=IkUw+lRenFwsponY9eox5D3RbhR1NjEGnbcBVFyhyBw=; b=tykdDJo5qCoOXZks6Dsqhx10xt4VdPFnjAmAy5EdsgwEPrxGWlYO29dLtTMhbSZOnR B0DN63LR00FHdwz4U0zzbdVfSojQeM5vVgx//W0y414MBmMpzVUEEMZfumFkIh5qEOOq dHL9bzXdiZVQUu666IDPHvwHA5DDlqgyttXE8hgLybxnBERpIPC2PORdKx8MoaMM9pYl eGfHuvIKeRavFWvSRFl1lw9AC2+n2tQPiR79wlx0cyEVi5a4SvYq7jOvUFUnGrc1RLca y4mix6aKlPOnQ+ah53SohlzgnLDagWh2T3hxDvDRsb8M/YWaGpdJkKn6HXeilEqI4D8s eXoQ== X-Gm-Message-State: AIkVDXJDuGU6D5c8FO85MA4hkTlLYnsKrtU2ktmVEKQL2BwVJG94yvWWH9qcL4bxzUjdJCYko/R5V5cHAn4WFQ== X-Received: by 10.28.41.5 with SMTP id p5mr11353083wmp.38.1484534417292; Sun, 15 Jan 2017 18:40:17 -0800 (PST) MIME-Version: 1.0 Received: by 10.195.12.46 with HTTP; Sun, 15 Jan 2017 18:40:16 -0800 (PST) From: Aijaz Baig Date: Mon, 16 Jan 2017 08:10:16 +0530 Message-ID: Subject: Understanding the rationale behind dropping of "block devices" To: FreeBSD Hackers , freebsd-scsi@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 02:40:20 -0000 I am a relative noob to the storage world in general and FreeBSD in particular. OF what I have been learning of late, I have become somewhat familiar with such concepts like disk queuing, IOPs, latencies and the likes I am also reading the classic 'The design and implementation of the FreeBSD operating system'. However of what I am reading, FreeBSD has "done away" with "block devices" altogether Of what I have been reading in that book and elsewhere it appears that the "block" devices have been dropped out of the architecture. So what I, with my (still) very limited knowledge of storage, understand this as there are no drivers in FreeBSD that would deal with blocks of data. But when I check the disk nodes under /dev I get this [CODE]ls -l /dev/*disk0 brw-r----- 1 root operator 14, 0 Jan 2 09:39 /dev/disk0 crw-r----- 1 root operator 14, 0 Jan 2 09:39 /dev/rdisk0[/CODE] where 'b' means block interface and 'c' means char or raw interface. So how do I reconcile this with what I read about "block devices being gone" before. What does 'block' mean here? Of what I know, the block device would be served through the "page cache" (a place where file system caches it's data and meta data) where as the raw device would be served via the "buffer cache" where "disk blocks" are cached by the OS. Thus a block device would be served via the file system where as the raw device won't. Is this correct?? If yes, then what does 'block' above signify? Or rephrasing the question, what was there earlier in FreeBSD before 'block device support' was dropped? I am sure seasoned storage veterans would have a lot more to add. I would be highly obliged if some one could please elaborate and add more context to it. -- Best Regards, Aijaz Baig From owner-freebsd-scsi@freebsd.org Mon Jan 16 07:20:33 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id EDD56CB22A3; Mon, 16 Jan 2017 07:20:33 +0000 (UTC) (envelope-from grog@lemis.com) Received: from www.lemis.com (www.lemis.com [208.86.226.86]) by mx1.freebsd.org (Postfix) with ESMTP id CCC651E19; Mon, 16 Jan 2017 07:20:33 +0000 (UTC) (envelope-from grog@lemis.com) Received: from eureka.lemis.com (www.lemis.com [208.86.226.86]) by www.lemis.com (Postfix) with ESMTP id 533D51B72804; Mon, 16 Jan 2017 07:11:06 +0000 (UTC) Received: by eureka.lemis.com (Postfix, from userid 1004) id 4DC0F4494B2; Mon, 16 Jan 2017 18:11:05 +1100 (AEDT) Date: Mon, 16 Jan 2017 18:11:05 +1100 From: Greg 'groggy' Lehey To: Aijaz Baig Cc: FreeBSD Hackers , freebsd-scsi@freebsd.org Subject: Re: Understanding the rationale behind dropping of "block devices" Message-ID: <20170116071105.GB4560@eureka.lemis.com> References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="gj572EiMnwbLXET9" Content-Disposition: inline In-Reply-To: Organization: The FreeBSD Project Phone: +61-3-5346-1370, +61-3-5309-0418 Mobile: 0401 265 606. Use only as instructed. WWW-Home-Page: http://www.FreeBSD.org/ X-PGP-Fingerprint: 9A1B 8202 BCCE B846 F92F 09AC 22E6 F290 507A 4223 User-Agent: Mutt/1.6.1 (2016-04-27) X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 07:20:34 -0000 --gj572EiMnwbLXET9 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Monday, 16 January 2017 at 8:10:16 +0530, Aijaz Baig wrote: > > But when I check the disk nodes under /dev I get this > [CODE]ls -l /dev/*disk0 > brw-r----- 1 root operator 14, 0 Jan 2 09:39 /dev/disk0 > crw-r----- 1 root operator 14, 0 Jan 2 09:39 /dev/rdisk0[/CODE] Are you sure that this is FreeBSD? The naming convention looks more like Mac OS, though the major device number doesn't match. FreeBSD has been through a number of disk naming conventions, but I'm pretty sure that we never had anything as straightforward as 'disk'. > what was there earlier in FreeBSD before 'block device support' was > dropped? Apart from the name, things used to look similar. Here a quote from "The Complete FreeBSD", written some time at the end of the last century: crw-r----- 1 root operator 3, 131072 Oct 31 19:59 /dev/rwd0s1a brw-r----- 1 root operator 0, 131072 Oct 31 19:59 /dev/wd0s1a The minor number included partition encoding, thus the large number. Greg -- Sent from my desktop computer. Finger grog@FreeBSD.org for PGP public key. See complete headers for address and phone numbers. This message is digitally signed. If your Microsoft mail program reports problems, please read http://lemis.com/broken-MUA --gj572EiMnwbLXET9 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlh8cgkACgkQIubykFB6QiM7jwCgoqb+1Zq6wHcox91JMKjSJCM8 7WEAmwbm2veBM5jStU+1syjSSVhxzM3D =fMg4 -----END PGP SIGNATURE----- --gj572EiMnwbLXET9-- From owner-freebsd-scsi@freebsd.org Mon Jan 16 08:49:21 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E1514CB1980; Mon, 16 Jan 2017 08:49:21 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: from mail-wm0-x22e.google.com (mail-wm0-x22e.google.com [IPv6:2a00:1450:400c:c09::22e]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 7717417F7; Mon, 16 Jan 2017 08:49:21 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: by mail-wm0-x22e.google.com with SMTP id f73so29182971wmf.1; Mon, 16 Jan 2017 00:49:21 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=8s/68QNRCN11+Bsk89IqMlROQYvJwcEM6Qt+bNy+G+g=; b=M3gURCAiVx/b1VzinSsZ/gTB9Cvtnoyv0xfTkB3Wj4ybNSwHP7Og7fgWeVOz1gwkBV DeEzIKUn7kjKuFiBAkbciO3p4JEWalFfFQgUZdK+DaKSnuQ4TU/JnFNIBmwoh700UIdR m+T15FhiDTuGDfmo6JpcAh2h8kAcJszNvXO5i0isptFhiijrI++fdFYhsycffgY85orN 4H1FcnjuEgzhGzBUJkUww+Y52WCDiaW70DLKQGsmyiA2Jd+6lqr5khuFFdYBJ7eL6Ygl E8Fs+p1dOsGD6nxUNg9+dr4ojca7kIQxSvUheFEoOHROVSdiGj8CPXzi4SxGO0gSBo9v im3Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=8s/68QNRCN11+Bsk89IqMlROQYvJwcEM6Qt+bNy+G+g=; b=fYDdiwE6Teq+bVSdjr1c23pk6hb5yRPvs+ZWPioNIEQ5mtDIQGAfQciA8S3eD2p043 yHyylZUsepMdoW6m/KMizq1ebrXwHC3lbN/uwkIJUuF8Jjrd3naL3HWzSjooIN1eUt1j kF6lCie6GWpUVgybXjz0DIodvlo/VwmqvPuzTr+hjFBKmAJUP+uCjByDom0q9ElYvAKm c6nxM6pVTciZqcq9sCghXprEs7xQLxtUkao8KIpKtTqFnuj6Pw6EI1aXUGI990ecDkLS KQ017tS3gMm7iURD2HP6WMFp2xwxfYmm0n/95rAhAKpcaKKBi4R0AJe92wRQomrMg7PL OI/g== X-Gm-Message-State: AIkVDXIU7GpDXZ5LENWSnZXU44W5n+p8It74FZyYOLhBoLdXvS29zuzZjREGv4YBnnkPrzvKwKD7giCu/apWew== X-Received: by 10.28.220.135 with SMTP id t129mr12474397wmg.38.1484556558922; Mon, 16 Jan 2017 00:49:18 -0800 (PST) MIME-Version: 1.0 Received: by 10.195.12.46 with HTTP; Mon, 16 Jan 2017 00:49:18 -0800 (PST) In-Reply-To: <20170116071105.GB4560@eureka.lemis.com> References: <20170116071105.GB4560@eureka.lemis.com> From: Aijaz Baig Date: Mon, 16 Jan 2017 14:19:18 +0530 Message-ID: Subject: Re: Understanding the rationale behind dropping of "block devices" To: "Greg 'groggy' Lehey" Cc: FreeBSD Hackers , freebsd-scsi@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 08:49:22 -0000 Oh yes I was actually running an old release inside a VM and yes I had changed the device names myself while jotting down notes (to give it a more descriptive name like what the OSX does). So now I've checked it on a recent release and yes there is indeed no block device. root@bsd-client:/dev # gpart show => 34 83886013 da0 GPT (40G) 34 1024 1 freebsd-boot (512K) 1058 58719232 2 freebsd-ufs (28G) 58720290 3145728 3 freebsd-swap (1.5G) 61866018 22020029 - free - (10G) root@bsd-client:/dev # ls -lrt da* crw-r----- 1 root operator 0x4d Dec 19 17:49 da0p1 crw-r----- 1 root operator 0x4b Dec 19 17:49 da0 crw-r----- 1 root operator 0x4f Dec 19 23:19 da0p3 crw-r----- 1 root operator 0x4e Dec 19 23:19 da0p2 So this shows that I have a single SATA or SAS drive and there are apparently 3 partitions ( or is it four?? Why does it show unused space when I had used the entire disk?) Nevertheless my question still holds. What does 'removing support for block device' mean in this context? Was what I mentioned earlier with regards to my understanding correct? Viz. all disk devices now have a character (or raw) interface and are no longer served via the "page cache" but rather the "buffer cache". Does that mean all disk accesses are now direct by passing the file system?? On Mon, Jan 16, 2017 at 12:41 PM, Greg 'groggy' Lehey wrote: > On Monday, 16 January 2017 at 8:10:16 +0530, Aijaz Baig wrote: > > > > But when I check the disk nodes under /dev I get this > > [CODE]ls -l /dev/*disk0 > > brw-r----- 1 root operator 14, 0 Jan 2 09:39 /dev/disk0 > > crw-r----- 1 root operator 14, 0 Jan 2 09:39 /dev/rdisk0[/CODE] > > Are you sure that this is FreeBSD? The naming convention looks more > like Mac OS, though the major device number doesn't match. FreeBSD > has been through a number of disk naming conventions, but I'm pretty > sure that we never had anything as straightforward as 'disk'. > > > what was there earlier in FreeBSD before 'block device support' was > > dropped? > > Apart from the name, things used to look similar. Here a quote from > "The Complete FreeBSD", written some time at the end of the last > century: > > crw-r----- 1 root operator 3, 131072 Oct 31 19:59 /dev/rwd0s1a > brw-r----- 1 root operator 0, 131072 Oct 31 19:59 /dev/wd0s1a > > The minor number included partition encoding, thus the large number. > > Greg > -- > Sent from my desktop computer. > Finger grog@FreeBSD.org for PGP public key. > See complete headers for address and phone numbers. > This message is digitally signed. If your Microsoft mail program > reports problems, please read http://lemis.com/broken-MUA > -- Best Regards, Aijaz Baig From owner-freebsd-scsi@freebsd.org Mon Jan 16 09:20:41 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 55A9BCB2874; Mon, 16 Jan 2017 09:20:41 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "vps1.elischer.org", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 3B71319ED; Mon, 16 Jan 2017 09:20:40 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from Julian-MBP3.local (ppp121-45-228-247.lns20.per1.internode.on.net [121.45.228.247]) (authenticated bits=0) by vps1.elischer.org (8.15.2/8.15.2) with ESMTPSA id v0G9KUFm024905 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Mon, 16 Jan 2017 01:20:36 -0800 (PST) (envelope-from julian@freebsd.org) Subject: Re: Understanding the rationale behind dropping of "block devices" To: Aijaz Baig , "Greg 'groggy' Lehey" References: <20170116071105.GB4560@eureka.lemis.com> Cc: FreeBSD Hackers , freebsd-scsi@freebsd.org From: Julian Elischer Message-ID: Date: Mon, 16 Jan 2017 17:20:25 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 09:20:41 -0000 On 16/01/2017 4:49 PM, Aijaz Baig wrote: > Oh yes I was actually running an old release inside a VM and yes I had > changed the device names myself while jotting down notes (to give it a more > descriptive name like what the OSX does). So now I've checked it on a > recent release and yes there is indeed no block device. > > root@bsd-client:/dev # gpart show > => 34 83886013 da0 GPT (40G) > 34 1024 1 freebsd-boot (512K) > 1058 58719232 2 freebsd-ufs (28G) > 58720290 3145728 3 freebsd-swap (1.5G) > 61866018 22020029 - free - (10G) > > root@bsd-client:/dev # ls -lrt da* > crw-r----- 1 root operator 0x4d Dec 19 17:49 da0p1 > crw-r----- 1 root operator 0x4b Dec 19 17:49 da0 > crw-r----- 1 root operator 0x4f Dec 19 23:19 da0p3 > crw-r----- 1 root operator 0x4e Dec 19 23:19 da0p2 > > So this shows that I have a single SATA or SAS drive and there are > apparently 3 partitions ( or is it four?? Why does it show unused space > when I had used the entire disk?) > > Nevertheless my question still holds. What does 'removing support for block > device' mean in this context? Was what I mentioned earlier with regards to > my understanding correct? Viz. all disk devices now have a character (or > raw) interface and are no longer served via the "page cache" but rather the > "buffer cache". Does that mean all disk accesses are now direct by passing > the file system?? Basically, FreeBSD never really buffered/cached by device. Buffering and caching is done by vnode in the filesystem. We have no device-based block cache. If you want file X at offset Y, then we can satisfy that from cache. VM objects map closely to vnode objects so the VM system IS the file system buffer cache. If you want device M, at offset N we will fetch it for you from the device, DMA'd directly into your address space, but there is no cached copy. Having said that, it would be trivial to add a 'caching' geom layer to the system but that has never been needed. The added complexity of carrying around two alternate interfaces to the same devices was judged by those who did the work to be not worth the small gain available to the very few people who used raw devices. Interestingly, since that time ZFS has implemented a block-layer cache for itself which is of course not integrated with the non-existing block level cache in the system :-). > > On Mon, Jan 16, 2017 at 12:41 PM, Greg 'groggy' Lehey > wrote: > >> On Monday, 16 January 2017 at 8:10:16 +0530, Aijaz Baig wrote: >>> But when I check the disk nodes under /dev I get this >>> [CODE]ls -l /dev/*disk0 >>> brw-r----- 1 root operator 14, 0 Jan 2 09:39 /dev/disk0 >>> crw-r----- 1 root operator 14, 0 Jan 2 09:39 /dev/rdisk0[/CODE] >> Are you sure that this is FreeBSD? The naming convention looks more >> like Mac OS, though the major device number doesn't match. FreeBSD >> has been through a number of disk naming conventions, but I'm pretty >> sure that we never had anything as straightforward as 'disk'. >> >>> what was there earlier in FreeBSD before 'block device support' was >>> dropped? >> Apart from the name, things used to look similar. Here a quote from >> "The Complete FreeBSD", written some time at the end of the last >> century: >> >> crw-r----- 1 root operator 3, 131072 Oct 31 19:59 /dev/rwd0s1a >> brw-r----- 1 root operator 0, 131072 Oct 31 19:59 /dev/wd0s1a >> >> The minor number included partition encoding, thus the large number. >> >> Greg >> -- >> Sent from my desktop computer. >> Finger grog@FreeBSD.org for PGP public key. >> See complete headers for address and phone numbers. >> This message is digitally signed. If your Microsoft mail program >> reports problems, please read http://lemis.com/broken-MUA >> > > From owner-freebsd-scsi@freebsd.org Mon Jan 16 09:31:21 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DBEC4CB2BF2; Mon, 16 Jan 2017 09:31:21 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from phk.freebsd.dk (phk.freebsd.dk [130.225.244.222]) by mx1.freebsd.org (Postfix) with ESMTP id A4EF8109A; Mon, 16 Jan 2017 09:31:21 +0000 (UTC) (envelope-from phk@phk.freebsd.dk) Received: from critter.freebsd.dk (unknown [192.168.55.3]) by phk.freebsd.dk (Postfix) with ESMTP id 41EA2273AC; Mon, 16 Jan 2017 09:31:14 +0000 (UTC) Received: from critter.freebsd.dk (localhost [127.0.0.1]) by critter.freebsd.dk (8.15.2/8.15.2) with ESMTP id v0G9VC81029470; Mon, 16 Jan 2017 09:31:13 GMT (envelope-from phk@phk.freebsd.dk) To: Julian Elischer cc: Aijaz Baig , "Greg 'groggy' Lehey" , FreeBSD Hackers , freebsd-scsi@freebsd.org Subject: Re: Understanding the rationale behind dropping of "block devices" In-reply-to: From: "Poul-Henning Kamp" References: <20170116071105.GB4560@eureka.lemis.com> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <29468.1484559072.1@critter.freebsd.dk> Content-Transfer-Encoding: quoted-printable Date: Mon, 16 Jan 2017 09:31:12 +0000 Message-ID: <29469.1484559072@critter.freebsd.dk> X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 09:31:22 -0000 -------- In message , Julian Elis= cher = writes: >Having said that, it would be trivial to add a 'caching' geom layer to = >the system but that has never been needed. A tinker-toy-cache like that would be architecturally disgusting. The right solution would be to enable mmap(2)'ing of disk(-like) devices, leveraging the VM systems exsting code for caching and optimistic prefetch/clustering, including the very primitive cache-control/visibility offered by madvise(2), mincore(2), mprotect(2), msync(2) etc. -- = Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk@FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe = Never attribute to malice what can adequately be explained by incompetence= . From owner-freebsd-scsi@freebsd.org Mon Jan 16 10:26:18 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 2BE75CAFECE for ; Mon, 16 Jan 2017 10:26:18 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from smtp.rlwinm.de (smtp.rlwinm.de [IPv6:2a01:4f8:201:31ef::e]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id ECD051CE7 for ; Mon, 16 Jan 2017 10:26:17 +0000 (UTC) (envelope-from crest@rlwinm.de) Received: from crest.local (unknown [87.253.189.132]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.rlwinm.de (Postfix) with ESMTPSA id 67CD6111D9 for ; Mon, 16 Jan 2017 11:26:08 +0100 (CET) Subject: Re: Understanding the rationale behind dropping of "block devices" To: freebsd-scsi@freebsd.org References: <20170116071105.GB4560@eureka.lemis.com> <29469.1484559072@critter.freebsd.dk> From: Jan Bramkamp Message-ID: <3a76c14b-d3a1-755b-e894-2869cd42aeb6@rlwinm.de> Date: Mon, 16 Jan 2017 11:26:07 +0100 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 MIME-Version: 1.0 In-Reply-To: <29469.1484559072@critter.freebsd.dk> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 10:26:18 -0000 On 16/01/2017 10:31, Poul-Henning Kamp wrote: > -------- > In message , Julian Elischer > writes: > >> Having said that, it would be trivial to add a 'caching' geom layer to >> the system but that has never been needed. > > A tinker-toy-cache like that would be architecturally disgusting. > > The right solution would be to enable mmap(2)'ing of disk(-like) > devices, leveraging the VM systems exsting code for caching and > optimistic prefetch/clustering, including the very primitive > cache-control/visibility offered by madvise(2), mincore(2), mprotect(2), > msync(2) etc. > Enabling mmap(2) on devices would be nice, but it would also create problems with revoke(2). The revoke(2) syscall allows revoking access to open devices (e.g. a serial console). This is required to securely logout users. The existing file descriptors are marked as revoked an will return EIO on every access. How would you implement gracefully revoking mapped device memory? Killing all those processes with SIGBUS/SIGSEGV would keep the system secure, but it would be far from elegant. From owner-freebsd-scsi@freebsd.org Mon Jan 16 10:39:07 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 23946CB0123 for ; Mon, 16 Jan 2017 10:39:07 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: from mail-wm0-x22b.google.com (mail-wm0-x22b.google.com [IPv6:2a00:1450:400c:c09::22b]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id A56BA10A5 for ; Mon, 16 Jan 2017 10:39:06 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: by mail-wm0-x22b.google.com with SMTP id r144so168663927wme.1 for ; Mon, 16 Jan 2017 02:39:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=nnMt6KRa6h324EoE99dVfrDcJ3AMMwJ+lwYHwCTyxts=; b=CXX1qhhFUT9U9WFJdnpnNOcFgZKmou18K7UxonXQQa0tBOgmSJJE6d49nuDCUYbfhW FKmXR+IjXBez0Mnz0eJ8QocuV6KSjy5teV3g4sjC0pF4dxVoMq/Q4UMoPMRqTA+DEc8p VfTLugswTzmmvPrYct9gSIGLZpI9moDxJu4ReNZ1yVH9rwhhH75kLQf3fwdZ43KUAVrz DwiLGz9zVKkrTn80njeBa4MZxXMPXF5PYZBGi6kONNPAW/H2PyWRg889FY7cynqZQum/ crrzJF8/VhIoopkPaPcoCCQKoiQJivNlsANv1u5omlQSba8prP4qcNHmM12nHyQHJpty FXRQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=nnMt6KRa6h324EoE99dVfrDcJ3AMMwJ+lwYHwCTyxts=; b=mL0XpqCzrhNODjTvXTKaZbSg9C1+gTLJLvrzE4oERcg9qxAsIdYAUM/a1IDQZaTIWP diTbHLzlrEFAEz4APEfrh6pf66y9BjrrWNVOtqSEEi1klp+5nTMLjIWEJ/Ai1oJVZuz8 czeiSA7gKBRT1TfrD8IJsf5AJbuw2PDzH7lKVJAQV4aGtH3t/5DaN6iGoyJ/qttr9NPb 5bxE0Tcwpoih4GDspLfPmj0+TZ/z+GKfqohsAKDxa8PYapMR1MNMyAXXPJ4nBrLWzPfs MxwMkmswLBvc3jp5G+Ild3q8Oaaucc621v27r9SIzfEJOwKmkDmDWBBaZUFBNL9+biv2 RARw== X-Gm-Message-State: AIkVDXJ22FEqHO4mswh7uIaHTIB9wca64Izjzr3xnhwoepi1p/lqQ+qgojvAdWX1104ALMfMvwHMeZt1LE9lFA== X-Received: by 10.223.163.30 with SMTP id c30mr1046519wrb.40.1484563144356; Mon, 16 Jan 2017 02:39:04 -0800 (PST) MIME-Version: 1.0 Received: by 10.195.12.46 with HTTP; Mon, 16 Jan 2017 02:39:03 -0800 (PST) In-Reply-To: <3a76c14b-d3a1-755b-e894-2869cd42aeb6@rlwinm.de> References: <20170116071105.GB4560@eureka.lemis.com> <29469.1484559072@critter.freebsd.dk> <3a76c14b-d3a1-755b-e894-2869cd42aeb6@rlwinm.de> From: Aijaz Baig Date: Mon, 16 Jan 2017 16:09:03 +0530 Message-ID: Subject: Re: Understanding the rationale behind dropping of "block devices" To: Jan Bramkamp Cc: freebsd-scsi@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 10:39:07 -0000 Oh thank you everyone for clearing the air a bit. Although for a noob like myself, that was mighty concise! Nevertheless, let me re-iterate what has been summarized in the last two mails so I know I got exactly what was being said. Let me begin that I come from the Linux world where there has traditionally been two separate caches, the "buffer cache" and the "page cache" although almost all IO is now driven through the "page cache". The buffer cache still remains however it now only caches disk blocks ( https://www.quora.com/What-is-the-difference-between-Buffers-and-Cached-columns-in-proc-meminfo-output). So 'read' and 'write' were satisfied through the buffer cache whereas 'fwrite/read', 'mmap' went through the page cache (which was actually populated by reading the buffer cache thereby wasting almost twice the memory and compute cycles). Hence the merging. Nevertheless, as had been mentioned by Julian, it appears that there is no "buffer cache" so to speak (is that correct Julian??) > If you want device M, at offset N we will fetch it for you from the device, DMA'd directly into your address space, but there is no cached copy. Instead it appears FreeBSD has a generic 'VM object' that is used to address myriad entities including disks and as such all operations have to go through the VM subsystem now. Does that also mean that there is no way an application can directly use raw disks? At least it appears so > The added complexity of carrying around two alternate interfaces to the same devices was judged by those who did the work to be not worth the small gain available to the very few people who used raw devices Thank you for all your inputs and waiting to hear more! Al though a bit more context would really help noobs (both to enterprise storage and FreeBSD) like me! On Mon, Jan 16, 2017 at 3:56 PM, Jan Bramkamp wrote: > On 16/01/2017 10:31, Poul-Henning Kamp wrote: > >> -------- >> In message , Julian >> Elischer >> writes: >> >> Having said that, it would be trivial to add a 'caching' geom layer to >>> the system but that has never been needed. >>> >> >> A tinker-toy-cache like that would be architecturally disgusting. >> >> The right solution would be to enable mmap(2)'ing of disk(-like) >> devices, leveraging the VM systems exsting code for caching and >> optimistic prefetch/clustering, including the very primitive >> cache-control/visibility offered by madvise(2), mincore(2), mprotect(2), >> msync(2) etc. >> >> Enabling mmap(2) on devices would be nice, but it would also create > problems with revoke(2). The revoke(2) syscall allows revoking access to > open devices (e.g. a serial console). This is required to securely logout > users. The existing file descriptors are marked as revoked an will return > EIO on every access. How would you implement gracefully revoking mapped > device memory? Killing all those processes with SIGBUS/SIGSEGV would keep > the system secure, but it would be far from elegant. > _______________________________________________ > freebsd-scsi@freebsd.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-scsi > To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" > -- Best Regards, Aijaz Baig From owner-freebsd-scsi@freebsd.org Mon Jan 16 10:50:02 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id DB674CB0312 for ; Mon, 16 Jan 2017 10:50:02 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: from mail-wm0-x22d.google.com (mail-wm0-x22d.google.com [IPv6:2a00:1450:400c:c09::22d]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 68F05142F for ; Mon, 16 Jan 2017 10:50:02 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: by mail-wm0-x22d.google.com with SMTP id r126so153770035wmr.0 for ; Mon, 16 Jan 2017 02:50:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=1roVqTRFFknK1IDO41k0DGcS7uBy1lMAi7MIWWJ5eHY=; b=RlJyASoJUNz4Xb528xPBt2boqfjLAl3VYZSEN/0Dukpz3jt9PdPx9ZLyHbvSLvcQjD jj9BypBu/saqAL5L1C0/L6m4kwssLfHFTC6ExmgC9Ko02UX4w741VTqa2tMVPJh4aIJR Fns7BlKcMqdaBjwAAt13GLXeDAybkUBe3myduVmpbAuSehu+mF26K4I5Jo4qrGiIbeEY uQPVZb62IU7l0dbnJZYGMrq6c10k8wLYKHvY3/AK0o1bkowc9RPKmdc4hQWgpTdy+5ZG 6+UPHu5qY+2tajNQvQHixn23LBTK6lOQSq7KJzqTCZ8dF2VD4N7EuQ+6ONk8fHVp16op K2Og== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=1roVqTRFFknK1IDO41k0DGcS7uBy1lMAi7MIWWJ5eHY=; b=qJtYOde/niNWNmARKep1V1eFpROiJ9f3OqN5h2gcQDjnJjvhutz+GSP9TYULoT1IX0 lRkGMvky+H4sE3P85bRgWmhcfodi5CCnFRCydYMvmGDy/gMNhNoDGJiez3Qz+Wl2Qdai 3Ywp+8haN0ASnx4N2+sdk/bOdxQcEG5uIWgVED5RA0yeaftSXu2MWn57N/cmM5p2ZiJc Z34SLfxy2oO4MWNl+ZZnrI2TKmYnM72+JaZZTDLzvVw9IlgaQ3jIAOZk+v24zLqolJ7r 0D0z+80G35zgVKCdfz05Jw5yWVc1/bTOgF2blvaOkW2Ib/I7atZBlspzlCf/gBVMgz02 /SMA== X-Gm-Message-State: AIkVDXKrXXbUEF6C31KHplGDT+zbd5hJJ/YmgLbcHztXWgM7WuKy48l5KDgvP8fxxw53LgTf6RFfz4S3DgoVAQ== X-Received: by 10.28.72.3 with SMTP id v3mr12996653wma.20.1484563800225; Mon, 16 Jan 2017 02:50:00 -0800 (PST) MIME-Version: 1.0 Received: by 10.195.12.46 with HTTP; Mon, 16 Jan 2017 02:49:59 -0800 (PST) In-Reply-To: References: <20170116071105.GB4560@eureka.lemis.com> <29469.1484559072@critter.freebsd.dk> <3a76c14b-d3a1-755b-e894-2869cd42aeb6@rlwinm.de> From: Aijaz Baig Date: Mon, 16 Jan 2017 16:19:59 +0530 Message-ID: Subject: Re: Understanding the rationale behind dropping of "block devices" To: Jan Bramkamp Cc: freebsd-scsi@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 10:50:03 -0000 I must add that I am getting confused specifically between two different things here: >From the replies above it appears that all disk accesses have to go through the VM subsystem now (so no raw disk accesses) however the arch handbook says raw interfaces are the way to go for disks ( https://www.freebsd.org/doc/en/books/arch-handbook/driverbasics-block.html)? Secondly, I presume that the VM subsystem has it's own caching and buffering mechanism that is independent to the file system so an IO can choose to skip buffering at the file-system layer however it will still be served by the VM cache irrespective of whatever the VM object maps to. Is that true? I believe this is what is meant by 'caching' at the VM layer. Any comments? On Mon, Jan 16, 2017 at 4:09 PM, Aijaz Baig wrote: > Oh thank you everyone for clearing the air a bit. Although for a noob like > myself, that was mighty concise! > > Nevertheless, let me re-iterate what has been summarized in the last two > mails so I know I got exactly what was being said. > > Let me begin that I come from the Linux world where there has > traditionally been two separate caches, the "buffer cache" and the "page > cache" although almost all IO is now driven through the "page cache". The > buffer cache still remains however it now only caches disk blocks ( > https://www.quora.com/What-is-the-difference-between- > Buffers-and-Cached-columns-in-proc-meminfo-output). So 'read' and 'write' > were satisfied through the buffer cache whereas 'fwrite/read', 'mmap' went > through the page cache (which was actually populated by reading the buffer > cache thereby wasting almost twice the memory and compute cycles). Hence > the merging. > > Nevertheless, as had been mentioned by Julian, it appears that there is no > "buffer cache" so to speak (is that correct Julian??) > > If you want device M, at offset N we will fetch it for you from the > device, DMA'd directly into your address space, but there is no cached > copy. > > Instead it appears FreeBSD has a generic 'VM object' that is used to > address myriad entities including disks and as such all operations have to > go through the VM subsystem now. Does that also mean that there is no way > an application can directly use raw disks? At least it appears so > > The added complexity of carrying around two alternate interfaces to the > same devices was judged by those who did the work to be not worth the small > gain available to the very few people who used raw devices > > Thank you for all your inputs and waiting to hear more! Al though a bit > more context would really help noobs (both to enterprise storage and > FreeBSD) like me! > > On Mon, Jan 16, 2017 at 3:56 PM, Jan Bramkamp wrote: > >> On 16/01/2017 10:31, Poul-Henning Kamp wrote: >> >>> -------- >>> In message , Julian >>> Elischer >>> writes: >>> >>> Having said that, it would be trivial to add a 'caching' geom layer to >>>> the system but that has never been needed. >>>> >>> >>> A tinker-toy-cache like that would be architecturally disgusting. >>> >>> The right solution would be to enable mmap(2)'ing of disk(-like) >>> devices, leveraging the VM systems exsting code for caching and >>> optimistic prefetch/clustering, including the very primitive >>> cache-control/visibility offered by madvise(2), mincore(2), mprotect(2), >>> msync(2) etc. >>> >>> Enabling mmap(2) on devices would be nice, but it would also create >> problems with revoke(2). The revoke(2) syscall allows revoking access to >> open devices (e.g. a serial console). This is required to securely logout >> users. The existing file descriptors are marked as revoked an will return >> EIO on every access. How would you implement gracefully revoking mapped >> device memory? Killing all those processes with SIGBUS/SIGSEGV would keep >> the system secure, but it would be far from elegant. >> _______________________________________________ >> freebsd-scsi@freebsd.org mailing list >> https://lists.freebsd.org/mailman/listinfo/freebsd-scsi >> To unsubscribe, send any mail to "freebsd-scsi-unsubscribe@freebsd.org" >> > > > > -- > > Best Regards, > Aijaz Baig > -- Best Regards, Aijaz Baig From owner-freebsd-scsi@freebsd.org Mon Jan 16 11:00:16 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 45B19CB0785; Mon, 16 Jan 2017 11:00:16 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id CA2151A35; Mon, 16 Jan 2017 11:00:15 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v0GB09Je011580 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 16 Jan 2017 13:00:09 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v0GB09Je011580 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id v0GB09BS011579; Mon, 16 Jan 2017 13:00:09 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 16 Jan 2017 13:00:09 +0200 From: Konstantin Belousov To: Julian Elischer Cc: Aijaz Baig , "Greg 'groggy' Lehey" , FreeBSD Hackers , freebsd-scsi@freebsd.org Subject: Re: Understanding the rationale behind dropping of "block devices" Message-ID: <20170116110009.GN2349@kib.kiev.ua> References: <20170116071105.GB4560@eureka.lemis.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.7.2 (2016-11-26) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 11:00:16 -0000 On Mon, Jan 16, 2017 at 05:20:25PM +0800, Julian Elischer wrote: > On 16/01/2017 4:49 PM, Aijaz Baig wrote: > > Oh yes I was actually running an old release inside a VM and yes I had > > changed the device names myself while jotting down notes (to give it a more > > descriptive name like what the OSX does). So now I've checked it on a > > recent release and yes there is indeed no block device. > > > > root@bsd-client:/dev # gpart show > > => 34 83886013 da0 GPT (40G) > > 34 1024 1 freebsd-boot (512K) > > 1058 58719232 2 freebsd-ufs (28G) > > 58720290 3145728 3 freebsd-swap (1.5G) > > 61866018 22020029 - free - (10G) > > > > root@bsd-client:/dev # ls -lrt da* > > crw-r----- 1 root operator 0x4d Dec 19 17:49 da0p1 > > crw-r----- 1 root operator 0x4b Dec 19 17:49 da0 > > crw-r----- 1 root operator 0x4f Dec 19 23:19 da0p3 > > crw-r----- 1 root operator 0x4e Dec 19 23:19 da0p2 > > > > So this shows that I have a single SATA or SAS drive and there are > > apparently 3 partitions ( or is it four?? Why does it show unused space > > when I had used the entire disk?) > > > > Nevertheless my question still holds. What does 'removing support for block > > device' mean in this context? Was what I mentioned earlier with regards to > > my understanding correct? Viz. all disk devices now have a character (or > > raw) interface and are no longer served via the "page cache" but rather the > > "buffer cache". Does that mean all disk accesses are now direct by passing > > the file system?? > > Basically, FreeBSD never really buffered/cached by device. > > Buffering and caching is done by vnode in the filesystem. > We have no device-based block cache. If you want file X at offset Y, > then we can satisfy that from cache. > VM objects map closely to vnode objects so the VM system IS the file > system buffer cache. This is not true. We do have buffer cache of the blocks read through the device (special) vnode. This is how, typically, the metadata for filesystems which are clients of the buffer cache, is handled, i.e. UFS msdosfs cd9600 etc. It is up to the filesystem to not create aliased cached copies of the blocks both in the device vnode buffer list and in the filesystem vnode. In fact, sometimes filesystems, e.g. UFS, consciously break this rule and read blocks of the user vnode through the disk cache. For instance, this happens for the SU truncation of the indirect blocks. > If you want device M, at offset N we will fetch it for you from the > device, DMA'd directly into your address space, > but there is no cached copy. > Having said that, it would be trivial to add a 'caching' geom layer to > the system but that has never been needed. The useful interpretation of the claim that FreeBSD does not cache disk blocks is that the cache is not accessible over the user-initiated i/o (read(2) and write(2)) through the opened devfs nodes. If a program issues such request, it indeed goes directly to/from disk driver, which is supplied a kernel buffer formed by remapped user pages. Note that if this device was or is mounted and filesystem kept some metadata in the buffer cache, then the devfs i/o would make the cache inconsistent. > The added complexity of carrying around two alternate interfaces to > the same devices was judged by those who did the work to be not worth > the small gain available to the very few people who used raw devices. > Interestingly, since that time ZFS has implemented a block-layer cache > for itself which is of course not integrated with the non-existing > block level cache in the system :-). We do carry two interfaces in the cdev drivers, which are lumped into one. In particular, it is not easy to implement mapping of the block devices exactly because the interfaces are mixed. If cdev disk device is mapped, VM would try to use cdevsw d_mmap or later mapping interfaces to handle user page faults, which is incorrect for the purpose of the disk block mapping. From owner-freebsd-scsi@freebsd.org Mon Jan 16 11:04:48 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 4446FCB0A8F; Mon, 16 Jan 2017 11:04:48 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from hz.grosbein.net (hz.grosbein.net [78.47.246.247]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "hz.grosbein.net", Issuer "hz.grosbein.net" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id D1E1E1111; Mon, 16 Jan 2017 11:04:47 +0000 (UTC) (envelope-from eugen@grosbein.net) Received: from eg.sd.rdtc.ru (root@eg.sd.rdtc.ru [62.231.161.221]) by hz.grosbein.net (8.15.2/8.15.2) with ESMTPS id v0GB4Zgl050449 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 16 Jan 2017 12:04:36 +0100 (CET) (envelope-from eugen@grosbein.net) X-Envelope-From: eugen@grosbein.net X-Envelope-To: julian@freebsd.org Received: from [10.58.0.10] (dadvw [10.58.0.10]) by eg.sd.rdtc.ru (8.15.2/8.15.2) with ESMTPS id v0GB4WYA046170 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NOT); Mon, 16 Jan 2017 18:04:32 +0700 (KRAT) (envelope-from eugen@grosbein.net) Subject: Re: Understanding the rationale behind dropping of "block devices" To: Julian Elischer , Aijaz Baig , "Greg 'groggy' Lehey" References: <20170116071105.GB4560@eureka.lemis.com> Cc: FreeBSD Hackers , freebsd-scsi@freebsd.org From: Eugene Grosbein Message-ID: <587CA8BC.1070609@grosbein.net> Date: Mon, 16 Jan 2017 18:04:28 +0700 User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=0.3 required=5.0 tests=BAYES_00,LOCAL_FROM autolearn=no autolearn_force=no version=3.4.1 X-Spam-Report: * -2.3 BAYES_00 BODY: Bayes spam probability is 0 to 1% * [score: 0.0000] * 2.6 LOCAL_FROM From my domains X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on hz.grosbein.net X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 11:04:48 -0000 16.01.2017 16:20, Julian Elischer wrote: > If you want device M, at offset N we will fetch it for you from the device, DMA'd directly into your address space, > but there is no cached copy. > Having said that, it would be trivial to add a 'caching' geom layer to the system but that has never been needed. In fact, FreeBSD does have geom_cache/gcache(8) for long time. It is block-level read cache passing write requests through transparently. It is unmaintained, though and there were some reports that it is suspected to cause kernel panics if there are more than one active GEOM_CACHE instances in a system. > The added complexity of carrying around two alternate interfaces to the same devices was judged by those who did the work to be not worth the small gain available to the very few people who used raw devices. > Interestingly, since that time ZFS has implemented a block-layer cache for itself which is of course not integrated with the non-existing block level cache in the system :-). From owner-freebsd-scsi@freebsd.org Mon Jan 16 11:15:43 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 53CA2CB0FDE for ; Mon, 16 Jan 2017 11:15:43 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id CD0C717A8 for ; Mon, 16 Jan 2017 11:15:42 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v0GBFate016035 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Mon, 16 Jan 2017 13:15:37 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v0GBFate016035 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id v0GBFaYu016034; Mon, 16 Jan 2017 13:15:36 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Mon, 16 Jan 2017 13:15:36 +0200 From: Konstantin Belousov To: Aijaz Baig Cc: Jan Bramkamp , freebsd-scsi@freebsd.org Subject: Re: Understanding the rationale behind dropping of "block devices" Message-ID: <20170116111536.GO2349@kib.kiev.ua> References: <20170116071105.GB4560@eureka.lemis.com> <29469.1484559072@critter.freebsd.dk> <3a76c14b-d3a1-755b-e894-2869cd42aeb6@rlwinm.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.7.2 (2016-11-26) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 16 Jan 2017 11:15:43 -0000 On Mon, Jan 16, 2017 at 04:19:59PM +0530, Aijaz Baig wrote: > I must add that I am getting confused specifically between two different > things here: > >From the replies above it appears that all disk accesses have to go through > the VM subsystem now (so no raw disk accesses) however the arch handbook > says raw interfaces are the way to go for disks ( > https://www.freebsd.org/doc/en/books/arch-handbook/driverbasics-block.html)? Do not mix the concept of raw disk access and using some VM code to implement this access. See my other reply for some more explanation of the raw disk access, physio in the kernel source files terminology, sys/kern/kern_physio.c. > > Secondly, I presume that the VM subsystem has it's own caching and > buffering mechanism that is independent to the file system so an IO can > choose to skip buffering at the file-system layer however it will still be > served by the VM cache irrespective of whatever the VM object maps to. Is > that true? I believe this is what is meant by 'caching' at the VM layer. First, the term page cache has different meaning in the kernel code, and that page cache was removed from the kernel very recently. More correct but much longer term is 'page queue of the vm object'. If given vnode has a vm object associated with it, then buffer cache ensures that buffers for the given chunk of the vnode data range are created from appropriately indexed pages from the queue. This way, buffer cache becomes consistent with the page queue. The vm object is created on the first vnode open by filesystem-specific code, at least for UFS/devfs/msdosfs/cd9600 etc. Caching policy for buffers is determined both by buffer cache and by (somewhat strong) hints from the filesystems interfacing with the cache. The pages constituing the buffer are wired, i.e. VM subsystem is informed by buffer cache to not reclaim pages while the buffer is alive. VM page caching, i.e. storing them in the vnode page queue, is only independent from the buffer cache when VM need/can handle something that does not involve the buffer cache. E.g. on page fault in the region backed by the file, VM allocates neccessary fresh (AKA without valid content) pages and issues read request into the filesystem which owns the vnode. It is up to the filesystem to implement read in any reasonable way. Until recently, UFS and other local filesystems provided raw disk block indexes for the generic VM code which then read content from the disk blocks into the pages. This has its own shares of problems (but not the consistency issue, since pages are allocated in the vnode vm object page queue). I changes that path to go through the buffer cache explicitely several months ago. But all this is highly depended on the filesystem. As the polar case, tmpfs reuses the swap-backed object, which holds the file data, as the vnode' vm object. The result is that paging requests from the tmpfs mapped file is handled as if it is swap-backed anonymous memory. ZFS cannot reuse vm object page queue for its very special cache ARC. So it keeps the consistency between writes and mmaps by copying the data on write(2) both into ARC buffer, and into the pages from vm object. Hope this helps. From owner-freebsd-scsi@freebsd.org Tue Jan 17 11:45:41 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E41B5CB4C02; Tue, 17 Jan 2017 11:45:41 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: from mail-wm0-x242.google.com (mail-wm0-x242.google.com [IPv6:2a00:1450:400c:c09::242]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 70BD61104; Tue, 17 Jan 2017 11:45:41 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: by mail-wm0-x242.google.com with SMTP id c85so5332970wmi.1; Tue, 17 Jan 2017 03:45:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=MLDYWwu3C7INMUeE2EMDxPoHzJWkqo8NYwQxuuAEVmc=; b=jPo8r0s8KEyY+RsOxWCPeWVtn6xuHj0JJRSXx9znjju92dvPw4WwALRFuE/5Z6KSdr oRIxMq2UOAuRTImqcblemYv1HtX+J72aIf0oWO22do76sQUWEIdC2OkXZIrUghYoyN1H 3s6VAf4HnnTpMMHiopIAPAguyPYIx9KRegAUDUTHE6uro+n3XfkZbdSvMiBiQBvEiApd h0x0soDfvTmKKV+HCuP4g9VxnRkQhdUs+J9wHWzjFABISN8gGnd/3UwgLPM5cGMahRky xqaejkc9gf6ghZ02rGA2Xor2oKBLj/Z9IN9A+Mjj61xyQfIJbGroL/oqtoSrjyrruRum Eneg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=MLDYWwu3C7INMUeE2EMDxPoHzJWkqo8NYwQxuuAEVmc=; b=b/m0Y2o76bn54eEBpqWrhydzelQFB7PdHwEPd/eUvyyAoOeqrUC2Ff8S4OoFT5C3dj /5fzNqnCdlBIu9TkO+iRAu2gySz8JDrPU8X0oHuZjlZVr5lX2Ajta+As1EKb0P8HH64T PmA9DtZL6MQlk4O1xH3GLnh/eeC6mFKJMZx0rX5eZVGlXIU10ljnCevwRFfmij+qaXoH d9MlFZEBr3FZi4oFbWIxXhUui3Q/riF306zvocfEW6jlsQCp9SLF4AyiOowb5qSAnnyg 3jSJJrRUSc8Z4oz73BF7eIZIi0bwdY+RJUufzxyEEK5FKssiMJ7EQfFI1btnZGbtpn3v tohQ== X-Gm-Message-State: AIkVDXKYjPnAiBrWVEI1kL2PG1TnY8/gF3CH/SVyP809UPbO80q8Xj/uLABeNOJ+/br+nI2CbW287tLCPp/Qxw== X-Received: by 10.223.166.106 with SMTP id k97mr30867636wrc.170.1484653539455; Tue, 17 Jan 2017 03:45:39 -0800 (PST) MIME-Version: 1.0 Received: by 10.195.12.46 with HTTP; Tue, 17 Jan 2017 03:45:38 -0800 (PST) In-Reply-To: <20170116110009.GN2349@kib.kiev.ua> References: <20170116071105.GB4560@eureka.lemis.com> <20170116110009.GN2349@kib.kiev.ua> From: Aijaz Baig Date: Tue, 17 Jan 2017 17:15:38 +0530 Message-ID: Subject: Re: Understanding the rationale behind dropping of "block devices" To: Konstantin Belousov Cc: Julian Elischer , "Greg 'groggy' Lehey" , FreeBSD Hackers , freebsd-scsi@freebsd.org Content-Type: text/plain; charset=UTF-8 X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jan 2017 11:45:42 -0000 First of all, a very big thank you to Konstantin for such a detailed reply! It has been a pleasure to go through. My replies are inline On Mon, Jan 16, 2017 at 4:30 PM, Konstantin Belousov wrote: > This is not true. > > We do have buffer cache of the blocks read through the device (special) > vnode. This is how, typically, the metadata for filesystems which are > clients of the buffer cache, is handled, i.e. UFS msdosfs cd9600 etc. > It is up to the filesystem to not create aliased cached copies of the > blocks both in the device vnode buffer list and in the filesystem vnode. > > In fact, sometimes filesystems, e.g. UFS, consciously break this rule > and read blocks of the user vnode through the disk cache. For instance, > this happens for the SU truncation of the indirect blocks. > This makes a lot more sense now. This basically means that no matter the underlying entity (for the vnode), be it a device special file or a remote location it always has to go through the VFS layer. So as you clearly said, the file-system has the sole discretion what to do with them. And as example, UFS does break this rule by having caching disk blocks which have already been cached by the VFS. > > If you want device M, at offset N we will fetch it for you from the > > device, DMA'd directly into your address space, > > but there is no cached copy. > > Having said that, it would be trivial to add a 'caching' geom layer to > > the system but that has never been needed. > The useful interpretation of the claim that FreeBSD does not cache > disk blocks is that the cache is not accessible over the user-initiated > i/o (read(2) and write(2)) through the opened devfs nodes. If a program > issues such request, it indeed goes directly to/from disk driver, which > is supplied a kernel buffer formed by remapped user pages. So basically read(2) and write(2) calls on device nodes bypass the VFS buffer cache as well and the IO indeed goes directly through the kernel memory pages (which as you said are in fact remapped user-land pages) So does that mean only the file-system code now uses the disk buffer cache? > Note that if this device was or is mounted and filesystem kept some > metadata in the buffer cache, then the devfs i/o would make the > cache inconsistent. > Device being mounted as a file-system you mean? Could you please elaborate? > > The added complexity of carrying around two alternate interfaces to > > the same devices was judged by those who did the work to be not worth > > the small gain available to the very few people who used raw devices. > > Interestingly, since that time ZFS has implemented a block-layer cache > > for itself which is of course not integrated with the non-existing > > block level cache in the system :-). > We do carry two interfaces in the cdev drivers, which are lumped into > one. In particular, it is not easy to implement mapping of the block > devices exactly because the interfaces are mixed. By "mapping" of the block devices, you mean serving the IO intended for the said disk blocks right? So as you said, we can either serve the IO via the VFS directly using buffer cache or we could do that via the file system cache > If cdev disk device is mapped, VM would try to use cdevsw'd_mmap > or later mapping interfaces to handle user page faults, which is incorrect > for the purpose of the disk block mapping. Could you please elaborate? > > I must add that I am getting confused specifically between two different > > things here: > > >From the replies above it appears that all disk accesses have to go through > > the VM subsystem now (so no raw disk accesses) however the arch handbook > > says raw interfaces are the way to go for disks ( > > https://www.freebsd.org/doc/en/books/arch-handbook/ driverbasics-block.html)? > Do not mix the concept of raw disk access and using some VM code to > implement this access. See my other reply for some more explanation of > the raw disk access, physio in the kernel source files terminology, > sys/kern/kern_physio.c. > yes I have taken a note of your earlier replies (thank you for being so elaborate) and as I have re-iterated earlier, I now understand that raw disk access is now direct between the kernel memory and the underlying device. So as you mentioned, the file system code (and perhaps only a few other entities) use the disk buffer cache that the VM implements. So an end user cannot interact with the buffer cache in any way is that what it is? > > Secondly, I presume that the VM subsystem has it's own caching and > > buffering mechanism that is independent to the file system so an IO can > > choose to skip buffering at the file-system layer however it will still be > > served by the VM cache irrespective of whatever the VM object maps to. Is > > that true? I believe this is what is meant by 'caching' at the VM layer. > First, the term page cache has different meaning in the kernel code, > and that page cache was removed from the kernel very recently. > More correct but much longer term is 'page queue of the vm object'. If > given vnode has a vm object associated with it, then buffer cache ensures > that buffers for the given chunk of the vnode data range are created from > appropriately indexed pages from the queue. This way, buffer cache becomes > consistent with the page queue. > The vm object is created on the first vnode open by filesystem-specific > code, at least for UFS/devfs/msdosfs/cd9600 etc. I understand page cache as a cache implemented by the file system to speed up IO access (at least this is what Linux defines it as). Does it have a different meaning in FreeBSD? So a vnode is a VFS entity right? And I presume a VM object is any object from the perspective of the virtual memory subsystem. Since we no longer run in real mode isn't every vnode actually supposed to have an entity in the VM subsystem? May be I am not understanding what 'page cache' means in FreeBSD? I mean every vnode in the VFS layer must have a backing VM object right? May be only mmap(2)'ed device nodes don't have a backing VM object or do they? If this assumption is correct than I cannot get my mind around what you mentioned regarding buffer caches coming into play for vnodes *only* if it has a backing vm object > > Caching policy for buffers is determined both by buffer cache and by > (somewhat strong) hints from the filesystems interfacing with the cache. > The pages constituting the buffer are wired, i.e. VM subsystem is informed > by buffer cache to not reclaim pages while the buffer is alive. > > VM page caching, i.e. storing them in the vnode page queue, is only > independent from the buffer cache when VM need/can handle something > that does not involve the buffer cache. E.g. on page fault in the > region backed by the file, VM allocates neccessary fresh (AKA without > valid content) pages and issues read request into the filesystem which > owns the vnode. It is up to the filesystem to implement read in any > reasonable way. > > Until recently, UFS and other local filesystems provided raw disk block > indexes for the generic VM code which then read content from the disk > blocks into the pages. This has its own shares of problems (but not > the consistency issue, since pages are allocated in the vnode vm > object page queue). I changes that path to go through the buffer cache > explicitely several months ago. > > But all this is highly depended on the filesystem. As the polar case, > tmpfs reuses the swap-backed object, which holds the file data, as the > vnode' vm object. The result is that paging requests from the tmpfs > mapped file is handled as if it is swap-backed anonymous memory. > > ZFS cannot reuse vm object page queue for its very special cache ARC. > So it keeps the consistency between writes and mmaps by copying the > data on write(2) both into ARC buffer, and into the pages from vm object. Well this is rather confusing (sorry again) may be too much detail for a noob like myself to appreciate at this stage of my journey. Nevertheless to summarize this, raw disk block access bypasses the buffer cache (as you had so painstakingly explained about read(2) and write(2) above) but is still cache by the VFS in the page queue. However this is also at the sole discretion of the file-system right? To summarize, the page cache (or rather the page queue for a given vnode) and the buffer cache are in fact separate entities although they are very tightly coupled for the most part except in cases like what you mentioned (about file backed data). So if we think of these as vertical layers, how would they look? From what you talk about page faults, it appears that the VM subsystem is apparently placed above or perhaps adjacent to the VFS layer Is that correct? Also about these caches, the buffer cache is a global cache available to both the VM subsystem as well as the VFS layer whereas the page queue for the vnode is the responsibility of the underlying file-system. Is that true? > Hope this helps. Of course this has helped. AL though it has raised a lot more questions as you can see at least it has got me thinking in (hopefully) the right direction. Once again a very big thank you!! Best Regards, Aijaz Baig From owner-freebsd-scsi@freebsd.org Tue Jan 17 17:36:38 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 1310DCB43BE for ; Tue, 17 Jan 2017 17:36:38 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from kenobi.freebsd.org (kenobi.freebsd.org [IPv6:2001:1900:2254:206a::16:76]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 0277A1AB2 for ; Tue, 17 Jan 2017 17:36:38 +0000 (UTC) (envelope-from bugzilla-noreply@freebsd.org) Received: from bugs.freebsd.org ([127.0.1.118]) by kenobi.freebsd.org (8.15.2/8.15.2) with ESMTP id v0HHabar090637 for ; Tue, 17 Jan 2017 17:36:37 GMT (envelope-from bugzilla-noreply@freebsd.org) From: bugzilla-noreply@freebsd.org To: freebsd-scsi@FreeBSD.org Subject: [Bug 204614] LOR In mpr(4) Date: Tue, 17 Jan 2017 17:36:37 +0000 X-Bugzilla-Reason: CC X-Bugzilla-Type: changed X-Bugzilla-Watch-Reason: None X-Bugzilla-Product: Base System X-Bugzilla-Component: kern X-Bugzilla-Version: CURRENT X-Bugzilla-Keywords: X-Bugzilla-Severity: Affects Only Me X-Bugzilla-Who: pete@nomadlogic.org X-Bugzilla-Status: Closed X-Bugzilla-Resolution: Unable to Reproduce X-Bugzilla-Priority: --- X-Bugzilla-Assigned-To: freebsd-bugs@FreeBSD.org X-Bugzilla-Flags: X-Bugzilla-Changed-Fields: resolution bug_status Message-ID: In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Bugzilla-URL: https://bugs.freebsd.org/bugzilla/ Auto-Submitted: auto-generated MIME-Version: 1.0 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 17 Jan 2017 17:36:38 -0000 https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=3D204614 pete@nomadlogic.org changed: What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |Unable to Reproduce Status|New |Closed --- Comment #3 from pete@nomadlogic.org --- No longer have access to this system so closing to clean up queue. --=20 You are receiving this mail because: You are on the CC list for the bug.= From owner-freebsd-scsi@freebsd.org Wed Jan 18 11:46:32 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id E7119CB69A5; Wed, 18 Jan 2017 11:46:32 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from kib.kiev.ua (kib.kiev.ua [IPv6:2001:470:d5e7:1::1]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 5E5681F3E; Wed, 18 Jan 2017 11:46:32 +0000 (UTC) (envelope-from kostikbel@gmail.com) Received: from tom.home (kib@localhost [127.0.0.1]) by kib.kiev.ua (8.15.2/8.15.2) with ESMTPS id v0IBkN08064452 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NO); Wed, 18 Jan 2017 13:46:23 +0200 (EET) (envelope-from kostikbel@gmail.com) DKIM-Filter: OpenDKIM Filter v2.10.3 kib.kiev.ua v0IBkN08064452 Received: (from kostik@localhost) by tom.home (8.15.2/8.15.2/Submit) id v0IBkNQG064451; Wed, 18 Jan 2017 13:46:23 +0200 (EET) (envelope-from kostikbel@gmail.com) X-Authentication-Warning: tom.home: kostik set sender to kostikbel@gmail.com using -f Date: Wed, 18 Jan 2017 13:46:23 +0200 From: Konstantin Belousov To: Aijaz Baig Cc: Julian Elischer , "Greg 'groggy' Lehey" , FreeBSD Hackers , freebsd-scsi@freebsd.org Subject: Re: Understanding the rationale behind dropping of "block devices" Message-ID: <20170118114623.GF2349@kib.kiev.ua> References: <20170116071105.GB4560@eureka.lemis.com> <20170116110009.GN2349@kib.kiev.ua> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.7.2 (2016-11-26) X-Spam-Status: No, score=-2.0 required=5.0 tests=ALL_TRUSTED,BAYES_00, DKIM_ADSP_CUSTOM_MED,FREEMAIL_FROM,NML_ADSP_CUSTOM_MED autolearn=no autolearn_force=no version=3.4.1 X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on tom.home X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 18 Jan 2017 11:46:33 -0000 On Tue, Jan 17, 2017 at 05:15:38PM +0530, Aijaz Baig wrote: > First of all, a very big thank you to Konstantin for such a detailed reply! > It has been a > pleasure to go through. My replies are inline > > On Mon, Jan 16, 2017 at 4:30 PM, Konstantin Belousov > wrote: > > > This is not true. > > > > We do have buffer cache of the blocks read through the device (special) > > vnode. This is how, typically, the metadata for filesystems which are > > clients of the buffer cache, is handled, i.e. UFS msdosfs cd9600 etc. > > It is up to the filesystem to not create aliased cached copies of the > > blocks both in the device vnode buffer list and in the filesystem vnode. > > > > In fact, sometimes filesystems, e.g. UFS, consciously break this rule > > and read blocks of the user vnode through the disk cache. For instance, > > this happens for the SU truncation of the indirect blocks. > > > This makes a lot more sense now. This basically means that no matter > the underlying entity (for the vnode), be it a device special file or a > remote location it always has to go through the VFS layer. So as you > clearly said, the file-system has the sole discretion what to do with them. > And as example, UFS does break this rule by having caching disk blocks > which have already been cached by the VFS. I do not understand what the 'it' that has to go through the VFS layer. > > > > If you want device M, at offset N we will fetch it for you from the > > > device, DMA'd directly into your address space, > > > but there is no cached copy. > > > Having said that, it would be trivial to add a 'caching' geom layer to > > > the system but that has never been needed. > > The useful interpretation of the claim that FreeBSD does not cache > > disk blocks is that the cache is not accessible over the user-initiated > > i/o (read(2) and write(2)) through the opened devfs nodes. If a program > > issues such request, it indeed goes directly to/from disk driver, which > > is supplied a kernel buffer formed by remapped user pages. > So basically read(2) and write(2) calls on device nodes bypass the VFS > buffer cache as well and the IO indeed goes directly through the kernel > memory pages (which as you said are in fact remapped user-land pages) > So does that mean only the file-system code now uses the disk buffer > cache? Right now, in the tree, only filesystems calls into vfs_bio.c. > > > Note that if this device was or is mounted and filesystem kept some > > metadata in the buffer cache, then the devfs i/o would make the > > cache inconsistent. > > > Device being mounted as a file-system you mean? Could you please > elaborate? Yes, device which carries a volume, and the volume is mounted. > > > > The added complexity of carrying around two alternate interfaces to > > > the same devices was judged by those who did the work to be not worth > > > the small gain available to the very few people who used raw devices. > > > Interestingly, since that time ZFS has implemented a block-layer cache > > > for itself which is of course not integrated with the non-existing > > > block level cache in the system :-). > > We do carry two interfaces in the cdev drivers, which are lumped into > > one. In particular, it is not easy to implement mapping of the block > > devices exactly because the interfaces are mixed. > By "mapping" of the block devices, you mean serving the IO intended for the > said disk blocks right? I mean, using mmap(2) interface on the file which references device special node. > > So as you said, we can either serve the IO via the > VFS > directly using buffer cache or we could do that via the file system cache > > > If cdev disk device is mapped, VM would try to use cdevsw'd_mmap > > or later mapping interfaces to handle user page faults, which is incorrect > > for the purpose of the disk block mapping. > Could you please elaborate? Read the code, I do not see much sense in rewording things that are stated in the code. > > > > I must add that I am getting confused specifically between two different > > > things here: > > > >From the replies above it appears that all disk accesses have to go > through > > > the VM subsystem now (so no raw disk accesses) however the arch handbook > > > says raw interfaces are the way to go for disks ( > > > https://www.freebsd.org/doc/en/books/arch-handbook/ > driverbasics-block.html)? > > Do not mix the concept of raw disk access and using some VM code to > > implement this access. See my other reply for some more explanation of > > the raw disk access, physio in the kernel source files terminology, > > sys/kern/kern_physio.c. > > > yes I have taken a note of your earlier replies (thank you for being so > elaborate) and > as I have re-iterated earlier, I now understand that raw disk access is now > direct > between the kernel memory and the underlying device. Such io is always direct between memory and device. The differences is in who owns the memory used for io, and how this is interpreted by system. > So as you mentioned, > the > file system code (and perhaps only a few other entities) use the disk > buffer cache > that the VM implements. So an end user cannot interact with the buffer > cache in > any way is that what it is? This question does not make any sense. Buffer cache is the kernel subsystem, used as a library for other parts of the kernel. > > > > Secondly, I presume that the VM subsystem has it's own caching and > > > buffering mechanism that is independent to the file system so an IO can > > > choose to skip buffering at the file-system layer however it will > still be > > > served by the VM cache irrespective of whatever the VM object maps to. > Is > > > that true? I believe this is what is meant by 'caching' at the VM layer. > > First, the term page cache has different meaning in the kernel code, > > and that page cache was removed from the kernel very recently. > > More correct but much longer term is 'page queue of the vm object'. If > > given vnode has a vm object associated with it, then buffer cache ensures > > that buffers for the given chunk of the vnode data range are created from > > appropriately indexed pages from the queue. This way, buffer cache > becomes > > consistent with the page queue. > > The vm object is created on the first vnode open by filesystem-specific > > code, at least for UFS/devfs/msdosfs/cd9600 etc. > I understand page cache as a cache implemented by the file system to speed > up IO access (at least this is what Linux defines it as). Does it have a > different > meaning in FreeBSD? I explicitely answered this question in advance, above. > > So a vnode is a VFS entity right? And I presume a VM object is any object > from > the perspective of the virtual memory subsystem. No, vm object is struct vm_object. > Since we no longer run in > real mode > isn't every vnode actually supposed to have an entity in the VM subsystem? > May be > I am not understanding what 'page cache' means in FreeBSD? At this point, I am not able to add any information to you. Unless you read the code, any further explanations would not provide any useful sense. > > I mean every vnode in the VFS layer must have a backing VM object right? No. > May be only mmap(2)'ed device nodes don't have a backing VM object or do > they? Device vnodes do have backing VM object, but they cannot be mapped. > If this assumption is correct than I cannot get my mind around what you > mentioned > regarding buffer caches coming into play for vnodes *only* if it has a > backing vm object I never said this. > > > > > Caching policy for buffers is determined both by buffer cache and by > > (somewhat strong) hints from the filesystems interfacing with the cache. > > The pages constituting the buffer are wired, i.e. VM subsystem is informed > > by buffer cache to not reclaim pages while the buffer is alive. > > > > VM page caching, i.e. storing them in the vnode page queue, is only > > independent from the buffer cache when VM need/can handle something > > that does not involve the buffer cache. E.g. on page fault in the > > region backed by the file, VM allocates neccessary fresh (AKA without > > valid content) pages and issues read request into the filesystem which > > owns the vnode. It is up to the filesystem to implement read in any > > reasonable way. > > > > Until recently, UFS and other local filesystems provided raw disk block > > indexes for the generic VM code which then read content from the disk > > blocks into the pages. This has its own shares of problems (but not > > the consistency issue, since pages are allocated in the vnode vm > > object page queue). I changes that path to go through the buffer cache > > explicitely several months ago. > > > > But all this is highly depended on the filesystem. As the polar case, > > tmpfs reuses the swap-backed object, which holds the file data, as the > > vnode' vm object. The result is that paging requests from the tmpfs > > mapped file is handled as if it is swap-backed anonymous memory. > > > > ZFS cannot reuse vm object page queue for its very special cache ARC. > > So it keeps the consistency between writes and mmaps by copying the > > data on write(2) both into ARC buffer, and into the pages from vm object. > Well this is rather confusing (sorry again) may be too much detail for a > noob > like myself to appreciate at this stage of my journey. > > Nevertheless to summarize this, raw disk block access bypasses the > buffer cache (as you had so painstakingly explained about read(2) and > write(2) above) but is still cache by the VFS in the page queue. However > this > is also at the sole discretion of the file-system right? > > To summarize, the page cache (or rather the page queue for a given vnode) > and the buffer cache are in fact separate entities although they are very > tightly coupled > for the most part except in cases like what you mentioned (about file > backed data). > > So if we think of these as vertical layers, how would they look? From what > you talk > about page faults, it appears that the VM subsystem is apparently placed > above or > perhaps adjacent to the VFS layer Is that correct? Also about these caches, > the buffer cache is a global cache available to both the VM subsystem as > well as > the VFS layer whereas the page queue for the vnode is the responsibility of > the > underlying file-system. Is that true? > > > Hope this helps. > Of course this has helped. AL though it has raised a lot more questions as > you can see > at least it has got me thinking in (hopefully) the right direction. Once > again a very big > thank you!! > > Best Regards, > Aijaz Baig From owner-freebsd-scsi@freebsd.org Sat Jan 21 03:04:20 2017 Return-Path: Delivered-To: freebsd-scsi@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id 40B32CBA215; Sat, 21 Jan 2017 03:04:20 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: from mail-pg0-x242.google.com (mail-pg0-x242.google.com [IPv6:2607:f8b0:400e:c05::242]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority G2" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 132C51B1D; Sat, 21 Jan 2017 03:04:20 +0000 (UTC) (envelope-from aijazbaig1@gmail.com) Received: by mail-pg0-x242.google.com with SMTP id 194so8281002pgd.0; Fri, 20 Jan 2017 19:04:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:content-transfer-encoding:message-id:date:subject:from :to:cc; bh=bJOdgRP6JH7kZiRo43OK3vrIWakt7Gu0nda9nKRecVQ=; b=SIfrxiiyklWKaDTzX33GMyXpdUjCDo4yYMLlVj6lMtu4WC+eeOAB0V7foRL4B90wyS Jozjp8vjZtWb7MA+BNkISxfuZmkekULrqk8tc/NqNxMbK7aubMzi7ElG2BrI2i8LeM5N QWAxEe9uvl814yFJI8K7Pvu9KyzqQQd+V4ZcMxIK5candNJZQn1oXRBL6pRbddmffqOi rhexdbJzViF576QXvgF6gAhxfsf874RSQEO0CXJUWMHWpMYTqVrt6vCrMvmnaTef7Rux UPikT82d0VJucAKRwQwcqEPC72Jp0HsGcwunutmwkPF5DyhWOnZsGiMvfhpmaZRJTESW 3eTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:content-transfer-encoding :message-id:date:subject:from:to:cc; bh=bJOdgRP6JH7kZiRo43OK3vrIWakt7Gu0nda9nKRecVQ=; b=LHKMLX2Z84ciK/sfgnStVGye3GQWtTssBtQ/8fJSOPMUyiGZ36LpBys8CTSSEMDNpC rDsdvh2o4lpNPWfWF1+oo1wz3H3QXslO/zLpW0S1Jr3OvIQj4cFg1ElqX7OwA3o5snv3 ybn7cZGbMvUD8FMlNkDrVXYoAOUext78+TyjHNf5Z1xnggbw3CTVgECdkW6lXKq1xhvp DCU84zy6Yt9XFACk5n9WvrNq4TFG0WC056oSHUdE4PmXumAILnT3GC+dq8x1pfrbKVVM m7YaQFzgAURjTWlffIw2myXmE3HXRDzGMSEwJ9DdSEXMZGUjK/YgprbVpNA1YVKKwt5Q rW8w== X-Gm-Message-State: AIkVDXKSPvSah/TRmvYWD788wiW47jQliqwN05IsksdHaixh8VGykunwv5qGP5lmafRiJw== X-Received: by 10.98.95.70 with SMTP id t67mr17417170pfb.37.1484967859327; Fri, 20 Jan 2017 19:04:19 -0800 (PST) Received: from [127.0.0.1] ([27.7.0.247]) by smtp.gmail.com with ESMTPSA id s64sm19857248pfe.27.2017.01.20.19.04.15 (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Fri, 20 Jan 2017 19:04:17 -0800 (PST) X-Mailer: BlackBerry Email (10.3.3.2163) Message-ID: <20170121030415.5111889.13690.2248@gmail.com> Date: Sat, 21 Jan 2017 08:34:15 +0530 Subject: Re: Understanding the rationale behind dropping of "block devices" From: Aijaz Baig To: Konstantin Belousov Cc: Julian Elischer , Greg 'groggy' Lehey , FreeBSD Hackers , freebsd-scsi@freebsd.org MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: freebsd-scsi@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: SCSI subsystem List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 21 Jan 2017 03:04:20 -0000