Skip site navigation (1)Skip section navigation (2)
Date:      Wed, 28 Feb 2024 20:22:08 -0800
From:      Chris Torek <chris.torek@gmail.com>
To:        George Mitchell <george+freebsd@m5p.com>
Cc:        FreeBSD Hackers <freebsd-hackers@freebsd.org>
Subject:   Re: ISO-8859-1 file name in UTF-8 file system
Message-ID:  <CAPx1GveyVSHBXJ7vG8oCY83auVrBOT33UoHrtO=cVRkuOreM=w@mail.gmail.com>
In-Reply-To: <8260e116-45af-4047-8138-3d0bb7b0ee2a@m5p.com>
References:  <8260e116-45af-4047-8138-3d0bb7b0ee2a@m5p.com>

next in thread | previous in thread | raw e-mail | index | archive | help
On Wed, Feb 28, 2024 at 5:31=E2=80=AFPM George Mitchell <george+freebsd@m5p=
.com> wrote:
> (I tried sending this to freebsd-python, but I can't post there
> because I haven't subscribed, and I'm hoping someone here will have
> a suggestion.  Thanks for your indulgence.)
>
> In Python 3.9 on FreeBSD 13.2-RELEASE, sys.getfilesystemencoding()
> reports 'utf-8'.  However, a couple of ancient files on one of my
> disks have names that were evidently ISO-8859-1 encoded at the time
> they were originally created.  When I os.walk() through a directory
> with one of these files, the UTF-8 string name of the file has, for
> example, a '\udcc3' in it.  Literally, the file name on disk had
> hex c3 at that position (ISO-8859-1 for =C3=83), and I guess \udcc3 is a
> surrogate for the 0xc3, which is incomprehensible in conformant
> UTF-8 (though I don't understand "surrogates" in UTF-8 and you can't
> take that last statement as gospel).
>
> Be that as it may, what can I do at this point to transmogrify that
> Python str with the \udcc3 back into the literal bytes found in the
> file name on the disk, so that I can then encode them into proper
> UTF-8 from ISO-8859-1?                                    -- George

I ran into this problem ages ago on another system.  Here is what I did
(note that some modern Python checkers hate the lambda form, I wrote
this a long time ago):

if sys.version_info[0] >=3D 3:
    # Python3 encodes "impossible" strings using UTF-8 and
    # surrogate escapes.  For instance, a file named <\300><\300>eek
    # (where \300 is octal 300, 0xc0 hex) turns into '\udcc0\udcc0eek'.
    # This is how we can losslessly re-encode this as a byte string:
    path_to_bytes =3D lambda path: path.encode('utf8', 'surrogateescape')

    # If we wish to print one of these byte strings, we have a
    # problem, because they're not valid UTF-8.  This method
    # treats the encoded bytes as pass-through, which is
    # probably the best we can do.
    bpath_to_str =3D lambda path: path.decode('unicode_escape')
else:
    # Python2 just uses byte strings, so OS paths are already
    # byte strings and we return them unmodified.
    path_to_bytes =3D lambda path: path
    bpath_to_str =3D lambda path: path

Chris



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?CAPx1GveyVSHBXJ7vG8oCY83auVrBOT33UoHrtO=cVRkuOreM=w>