From nobody Thu Feb 29 04:22:08 2024 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4TldNc624tz5BqrJ for ; Thu, 29 Feb 2024 04:22:24 +0000 (UTC) (envelope-from chris.torek@gmail.com) Received: from mail-ed1-x535.google.com (mail-ed1-x535.google.com [IPv6:2a00:1450:4864:20::535]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4TldNb69FNz4QYv for ; Thu, 29 Feb 2024 04:22:23 +0000 (UTC) (envelope-from chris.torek@gmail.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=gmail.com header.s=20230601 header.b=naMBRGoj; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (mx1.freebsd.org: domain of chris.torek@gmail.com designates 2a00:1450:4864:20::535 as permitted sender) smtp.mailfrom=chris.torek@gmail.com Received: by mail-ed1-x535.google.com with SMTP id 4fb4d7f45d1cf-5645960cd56so754815a12.1 for ; Wed, 28 Feb 2024 20:22:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709180540; x=1709785340; darn=freebsd.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=OZP3ygfg1vqXKKW6SAK2XmwGAnnrg4k3hJcSaL3uTzY=; b=naMBRGojldXhA5IkPY0uA2oSfrimNcbUD70WNHN1NlLQ00b+m5Z5au7GlNp8gi9dmi 6ZMmVrEL93IA6KctZSrgrZbvvn/PnggvsfGg8bJP6adPxLDRGaInAr1IwBucvUP4mAZr on0ETxZJV8Hv7uP3cSpDCTl8crWimAUShMk6CeM0dIAOO8NLgO9zYKSG161iB5i4b0iH I1M/xmCQUUy7iOg02YdKdRQ7feI1ZJQlyUu+QBRFLqsGAhfj9HiEQ69y5uXs4JGvhFRU kNHrlZuAQXXoW8eX0AeMM9Mp9iUQ3CnyTW2ut7Xo5Z0xMt1vXIWDB9ntVyowIUp28Gqa WfwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709180540; x=1709785340; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=OZP3ygfg1vqXKKW6SAK2XmwGAnnrg4k3hJcSaL3uTzY=; b=vAizOuT3L9sHIx2fHNsL4bN0lR1BrtqOLW6WQhRuh8u1w0ug4XG5FJz1gtcSJzAPUL 6YX5lKjGGhuwIhELOeDfITd3dLP7F4/TXE6G+sGuwkAAgc8vmchZhPt15zrolXLWoFAu ZrxC5sC+0vGYOPgxomyk37UwygwpRgRnUsBBiL+/3r0oqLhdwcdvtIQ6XQBChFf6n1jw Uyu9kn3qY42xVsFtoXZMysA4h1YZ2W6dE+MTE9KpCnPya4ZInSMRHY77Tmo/rRp19WYb bHZhHKGIVuO6h0W7IUGjSkHBtLDz/k8cC0ZUeIsyeMrOv5HYnB9r3Gx/dQ0nElsqRy9d kNOQ== X-Gm-Message-State: AOJu0YxOHQ1W3LC/8OH2Q13RtdqLBcaMe0OF5uzbyo8MabBqbj9U1Sib D+LWHlcofMkbIpIwwlwjrH8MSquuVnb6MnA8ULcvyheQNIZYJUGqhZbcCq5+4xQh38mjSIwlkv/ FrzXyBHI0dVGqcsegrbm/fDW/vBdkFHiDB6c= X-Google-Smtp-Source: AGHT+IExCvLB3VLqTswnaBd0P6G+n2eBsPH9vP1V8FftZOaPclecTLC+1GsbB8Oy16MIO8X1L2slx4NAsPCZc1/zMn0= X-Received: by 2002:a17:906:f913:b0:a3f:70bc:bfe4 with SMTP id lc19-20020a170906f91300b00a3f70bcbfe4mr513827ejb.31.1709180540472; Wed, 28 Feb 2024 20:22:20 -0800 (PST) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 References: <8260e116-45af-4047-8138-3d0bb7b0ee2a@m5p.com> In-Reply-To: <8260e116-45af-4047-8138-3d0bb7b0ee2a@m5p.com> From: Chris Torek Date: Wed, 28 Feb 2024 20:22:08 -0800 Message-ID: Subject: Re: ISO-8859-1 file name in UTF-8 file system To: George Mitchell Cc: FreeBSD Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.50 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-0.50)[-0.502]; DMARC_POLICY_ALLOW(-0.50)[gmail.com,none]; R_SPF_ALLOW(-0.20)[+ip6:2a00:1450:4000::/36]; R_DKIM_ALLOW(-0.20)[gmail.com:s=20230601]; MIME_GOOD(-0.10)[text/plain]; TO_DN_ALL(0.00)[]; TAGGED_FROM(0.00)[]; RCVD_TLS_LAST(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; MIME_TRACE(0.00)[0:+]; ARC_NA(0.00)[]; FREEMAIL_ENVFROM(0.00)[gmail.com]; FREEMAIL_FROM(0.00)[gmail.com]; DKIM_TRACE(0.00)[gmail.com:+]; FROM_HAS_DN(0.00)[]; DWL_DNSWL_NONE(0.00)[gmail.com:dkim]; PREVIOUSLY_DELIVERED(0.00)[freebsd-hackers@freebsd.org]; TO_MATCH_ENVRCPT_SOME(0.00)[]; FROM_EQ_ENVFROM(0.00)[]; MISSING_XM_UA(0.00)[]; MID_RHS_MATCH_FROMTLD(0.00)[]; TAGGED_RCPT(0.00)[freebsd]; MLMMJ_DEST(0.00)[freebsd-hackers@freebsd.org]; RCVD_COUNT_ONE(0.00)[1]; ASN(0.00)[asn:15169, ipnet:2a00:1450::/32, country:US]; RCVD_IN_DNSWL_NONE(0.00)[2a00:1450:4864:20::535:from] X-Rspamd-Queue-Id: 4TldNb69FNz4QYv On Wed, Feb 28, 2024 at 5:31=E2=80=AFPM George Mitchell wrote: > (I tried sending this to freebsd-python, but I can't post there > because I haven't subscribed, and I'm hoping someone here will have > a suggestion. Thanks for your indulgence.) > > In Python 3.9 on FreeBSD 13.2-RELEASE, sys.getfilesystemencoding() > reports 'utf-8'. However, a couple of ancient files on one of my > disks have names that were evidently ISO-8859-1 encoded at the time > they were originally created. When I os.walk() through a directory > with one of these files, the UTF-8 string name of the file has, for > example, a '\udcc3' in it. Literally, the file name on disk had > hex c3 at that position (ISO-8859-1 for =C3=83), and I guess \udcc3 is a > surrogate for the 0xc3, which is incomprehensible in conformant > UTF-8 (though I don't understand "surrogates" in UTF-8 and you can't > take that last statement as gospel). > > Be that as it may, what can I do at this point to transmogrify that > Python str with the \udcc3 back into the literal bytes found in the > file name on the disk, so that I can then encode them into proper > UTF-8 from ISO-8859-1? -- George I ran into this problem ages ago on another system. Here is what I did (note that some modern Python checkers hate the lambda form, I wrote this a long time ago): if sys.version_info[0] >=3D 3: # Python3 encodes "impossible" strings using UTF-8 and # surrogate escapes. For instance, a file named <\300><\300>eek # (where \300 is octal 300, 0xc0 hex) turns into '\udcc0\udcc0eek'. # This is how we can losslessly re-encode this as a byte string: path_to_bytes =3D lambda path: path.encode('utf8', 'surrogateescape') # If we wish to print one of these byte strings, we have a # problem, because they're not valid UTF-8. This method # treats the encoded bytes as pass-through, which is # probably the best we can do. bpath_to_str =3D lambda path: path.decode('unicode_escape') else: # Python2 just uses byte strings, so OS paths are already # byte strings and we return them unmodified. path_to_bytes =3D lambda path: path bpath_to_str =3D lambda path: path Chris