From owner-freebsd-current@freebsd.org Wed Jul 20 20:23:56 2016 Return-Path: Delivered-To: freebsd-current@mailman.ysv.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2001:1900:2254:206a::19:1]) by mailman.ysv.freebsd.org (Postfix) with ESMTP id B0430B9F69B for ; Wed, 20 Jul 2016 20:23:56 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from gw.catspoiler.org (unknown [IPv6:2602:304:b010:ef20::f2]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "gw.catspoiler.org", Issuer "gw.catspoiler.org" (not verified)) by mx1.freebsd.org (Postfix) with ESMTPS id 92BCC16B3 for ; Wed, 20 Jul 2016 20:23:56 +0000 (UTC) (envelope-from truckman@FreeBSD.org) Received: from FreeBSD.org (mousie.catspoiler.org [192.168.101.2]) by gw.catspoiler.org (8.15.2/8.15.2) with ESMTP id u6KKNksl055230; Wed, 20 Jul 2016 13:23:50 -0700 (PDT) (envelope-from truckman@FreeBSD.org) Message-Id: <201607202023.u6KKNksl055230@gw.catspoiler.org> Date: Wed, 20 Jul 2016 13:23:46 -0700 (PDT) From: Don Lewis Subject: Re: UTF-8 by default? To: darkuranium@gmail.com cc: freebsd-current@freebsd.org In-Reply-To: MIME-Version: 1.0 Content-Type: TEXT/plain; charset=iso-8859-2 Content-Transfer-Encoding: 8BIT X-BeenThere: freebsd-current@freebsd.org X-Mailman-Version: 2.1.22 Precedence: list List-Id: Discussions about the use of FreeBSD-current List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 20 Jul 2016 20:23:56 -0000 On 20 Jul, Tim Čas wrote: > On 20 July 2016 at 20:33, Don Lewis wrote: >> wc(1) has problems with its multibyte support pointed out by Coverity >> as I recall. > > Not sure how critical that issue is (e.g. byte counts [`-c`], line > counts [`-l`], and such should still work as intended; whether word > counts work or not depends on whether we should count Unicode > whitespace as, well, whitespace). I do wonder if everyone agrees that > an effort should be made towards UTF-8 default, though? It passes a fixed-length non-NUL terminated buffer (returned by read(2)) to mbrtowc(). In addition to the lack of termination, the buffer could also contain a partial character at its beginning or end if the contents are UTF-8. The Coverity ID is 978825.