From owner-freebsd-arch@FreeBSD.ORG Sat Apr 12 23:49:56 2008 Return-Path: Delivered-To: arch@freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:4f8:fff6::34]) by hub.freebsd.org (Postfix) with ESMTP id 16C12106564A for ; Sat, 12 Apr 2008 23:49:56 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: from rv-out-0506.google.com (rv-out-0506.google.com [209.85.198.234]) by mx1.freebsd.org (Postfix) with ESMTP id C62898FC18 for ; Sat, 12 Apr 2008 23:49:55 +0000 (UTC) (envelope-from jroberson@jroberson.net) Received: by rv-out-0506.google.com with SMTP id b25so337995rvf.43 for ; Sat, 12 Apr 2008 16:49:55 -0700 (PDT) Received: by 10.141.74.17 with SMTP id b17mr2470617rvl.113.1208044195334; Sat, 12 Apr 2008 16:49:55 -0700 (PDT) Received: from ?10.0.1.199? ( [24.94.72.120]) by mx.google.com with ESMTPS id f21sm7359486rvb.0.2008.04.12.16.49.53 (version=SSLv3 cipher=OTHER); Sat, 12 Apr 2008 16:49:54 -0700 (PDT) Date: Sat, 12 Apr 2008 13:51:15 -1000 (HST) From: Jeff Roberson X-X-Sender: jroberson@desktop To: arch@freebsd.org Message-ID: <20080412132457.W43186@desktop> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed Cc: Subject: f_offset X-BeenThere: freebsd-arch@freebsd.org X-Mailman-Version: 2.1.5 Precedence: list List-Id: Discussion related to FreeBSD architecture List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 12 Apr 2008 23:49:56 -0000 So I'm in the midst of working on other filesystem concurrency issues and that has brought me back around to f_offset again. I'm working on a method to allow non-overlapping writes and reads to proceed concurrently to the same file. This means the exclusive vnode lock can not be used to protect f_offset even in the write case. To maintain the existing semantics I'm simply going to add an exclusive sx_xlock() around access to f_offset. This is done inconsistently today which is fine from the perspective of the updates in most cases being user-space races. However, f_offset is 64bit and can not be written atomically on 32bit systems and so requires some extra synchronization there. The sx lock will nearly double the size of struct file. Although it's lost some weight in 8.0 that is quite unfortunate. However, the method of using LOCKED & WAITING flags, msleep and a mutex has ruined performance in too many cases to continue using it. It's worth discussing what posix actually guarantees for f_offset as well as what other operating systems do. POSIX actually does not guarantee any behavior with simultaneous access. Multiple readers may read the same position in the file concurrently and update the position to different offsets. Multiple writers may write to the same file location, although the io should be serialized by some other means. Posix allows for and Solaris, Linux, and historic implementations of f_offset work in the following way: off = fp->f_offset; lock(vnode); vn_rdwr() unlock(vnode) fp->f_offset = uio->uio_offset; What we implement is much stricter. It is essentially this: lock(offset); off = fp->f_offset; lock(vnode); vn_rdwr() unlock(vnode); fp->f_offset = uio->uio_offset; unlock(offset); We provide the following extra guarantees: 1) Multiple readers will never see overlapping segments of the file 2) Multiple writers will never write to overlapping segments of the file McKusick changed the behavior in 1986, I would guess for an rforked process. There is some test code in this fairly interesting lkml thread where they discuss the problem in linux: http://lkml.org/lkml/2006/4/12/227 Simply having multiple threads write to stdout in a file on linux is enough to lose or corrupt output. I believe it is worth retaining the write guarantee. However, I believe the read guarantees are simply a side-effect of the original implementation of the write fix. I will probably commit a patch to add the sx with exclusive behavior to start. We need to at least protect 64bit access on 32bit machines in lseek() which we don't today. Beyond that I think we can relax the read restriction and allow f_offset readers to operate locklessly and only serialize writers. For this to work it would be nice if we had a MD way to write 64bits atomically that simply acquired a lock on 32bit platforms without something like cmpxchg8b. Or on UP just did the write with interrupts disabled. Comments? Thanks, Jeff