From owner-freebsd-hackers@freebsd.org Fri Jan 27 02:25:43 2017 Return-Path: Delivered-To: freebsd-hackers@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id D60E2CC30A4; Fri, 27 Jan 2017 02:25:43 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from vps1.elischer.org (vps1.elischer.org [204.109.63.16]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "vps1.elischer.org", Issuer "CA Cert Signing Authority" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id BA4D9DC4; Fri, 27 Jan 2017 02:25:43 +0000 (UTC) (envelope-from julian@freebsd.org) Received: from Julian-MBP3.local (ppp121-45-228-247.lns20.per1.internode.on.net [121.45.228.247]) (authenticated bits=0) by vps1.elischer.org (8.15.2/8.15.2) with ESMTPSA id v0R2Pb45085620 (version=TLSv1.2 cipher=DHE-RSA-AES128-SHA bits=128 verify=NO); Thu, 26 Jan 2017 18:25:41 -0800 (PST) (envelope-from julian@freebsd.org) Subject: Re: Understanding the rationale behind dropping of "block devices" To: Konstantin Belousov References: <20170116071105.GB4560@eureka.lemis.com> <20170116110009.GN2349@kib.kiev.ua> Cc: Aijaz Baig , "Greg 'groggy' Lehey" , FreeBSD Hackers , freebsd-scsi@freebsd.org From: Julian Elischer Message-ID: <7cf12959-5c1e-2be8-5974-69a96f2cd9d7@freebsd.org> Date: Fri, 27 Jan 2017 10:25:32 +0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:45.0) Gecko/20100101 Thunderbird/45.6.0 MIME-Version: 1.0 In-Reply-To: <20170116110009.GN2349@kib.kiev.ua> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: freebsd-hackers@freebsd.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Technical Discussions relating to FreeBSD List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 27 Jan 2017 02:25:43 -0000 On 16/1/17 7:00 pm, Konstantin Belousov wrote: > On Mon, Jan 16, 2017 at 05:20:25PM +0800, Julian Elischer wrote: >> On 16/01/2017 4:49 PM, Aijaz Baig wrote: >>> Oh yes I was actually running an old release inside a VM and yes I had >>> changed the device names myself while jotting down notes (to give it a more >>> descriptive name like what the OSX does). So now I've checked it on a >>> recent release and yes there is indeed no block device. >>> >>> root@bsd-client:/dev # gpart show >>> => 34 83886013 da0 GPT (40G) >>> 34 1024 1 freebsd-boot (512K) >>> 1058 58719232 2 freebsd-ufs (28G) >>> 58720290 3145728 3 freebsd-swap (1.5G) >>> 61866018 22020029 - free - (10G) >>> >>> root@bsd-client:/dev # ls -lrt da* >>> crw-r----- 1 root operator 0x4d Dec 19 17:49 da0p1 >>> crw-r----- 1 root operator 0x4b Dec 19 17:49 da0 >>> crw-r----- 1 root operator 0x4f Dec 19 23:19 da0p3 >>> crw-r----- 1 root operator 0x4e Dec 19 23:19 da0p2 >>> >>> So this shows that I have a single SATA or SAS drive and there are >>> apparently 3 partitions ( or is it four?? Why does it show unused space >>> when I had used the entire disk?) >>> >>> Nevertheless my question still holds. What does 'removing support for block >>> device' mean in this context? Was what I mentioned earlier with regards to >>> my understanding correct? Viz. all disk devices now have a character (or >>> raw) interface and are no longer served via the "page cache" but rather the >>> "buffer cache". Does that mean all disk accesses are now direct by passing >>> the file system?? >> Basically, FreeBSD never really buffered/cached by device. >> >> Buffering and caching is done by vnode in the filesystem. >> We have no device-based block cache. If you want file X at offset Y, >> then we can satisfy that from cache. >> VM objects map closely to vnode objects so the VM system IS the file >> system buffer cache. > This is not true. > > We do have buffer cache of the blocks read through the device (special) > vnode. This is how, typically, the metadata for filesystems which are > clients of the buffer cache, is handled, i.e. UFS msdosfs cd9600 etc. > It is up to the filesystem to not create aliased cached copies of the > blocks both in the device vnode buffer list and in the filesystem vnode. > > In fact, sometimes filesystems, e.g. UFS, consciously break this rule > and read blocks of the user vnode through the disk cache. For instance, > this happens for the SU truncation of the indirect blocks. yes this caches blocks as an offset into a device, but it is still really a part of the system which provides caching services to vnodes. (at least that is how it was last time I looked) > >> If you want device M, at offset N we will fetch it for you from the >> device, DMA'd directly into your address space, >> but there is no cached copy. >> Having said that, it would be trivial to add a 'caching' geom layer to >> the system but that has never been needed. > The useful interpretation of the claim that FreeBSD does not cache > disk blocks is that the cache is not accessible over the user-initiated > i/o (read(2) and write(2)) through the opened devfs nodes. If a program > issues such request, it indeed goes directly to/from disk driver, which > is supplied a kernel buffer formed by remapped user pages. Note that > if this device was or is mounted and filesystem kept some metadata in > the buffer cache, then the devfs i/o would make the cache inconsistent. > >> The added complexity of carrying around two alternate interfaces to >> the same devices was judged by those who did the work to be not worth >> the small gain available to the very few people who used raw devices. >> Interestingly, since that time ZFS has implemented a block-layer cache >> for itself which is of course not integrated with the non-existing >> block level cache in the system :-). > We do carry two interfaces in the cdev drivers, which are lumped into > one. In particular, it is not easy to implement mapping of the block > devices exactly because the interfaces are mixed. If cdev disk device is > mapped, VM would try to use cdevsw d_mmap or later mapping interfaces to > handle user page faults, which is incorrect for the purpose of the disk > block mapping. >