Date: Thu, 11 Apr 2002 13:39:55 -0700 From: Terry Lambert <tlambert2@mindspring.com> To: Christian Weisgerber <naddy@mips.inka.de> Cc: freebsd-chat@freebsd.org Subject: Re: what are these characters please? Message-ID: <3CB5F49B.C21B24E9@mindspring.com> References: <3CB571D6.2C10B9AA@mindspring.com> <20020411113858.E48BB3F30@bast.unixathome.org> <a9492h$2g43$1@kemoauc.mips.inka.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Christian Weisgerber wrote: > > I think my goal here is remove all non-ISO-8859-1 characters from the > > incoming cvs-all message. > > It makes more sense to clobber everything that isn't ASCII. > > chomp($line); > $line ~= tr/\x09\x20-\x7E/?/c; # tab, printable ASCII > > Putting a replacement character such as '?' or '#' there is probably > less confusing than outright deleting the offending bytes. In this case, it's probably ISO 2022 based EUC encoding for JIS-208, so it's not going to be relevent anyway, since what has to be replaced is a chacter set change sequence, a character, and a change back. In this particular case, the advice about non-printable ASCII characters doesn't work, either, since it will only swallow the <ESC>, and not the rest of the sequence or the terminator. Living with it -- or stripping the control characters -- is probably the only thing that will work. The character set encoding information was lost when the cut-and-paste happened (this is a good argument for Unicode, *NOT* UTF-8, and 16 bit wchar_t). In this case, stripping the escape sequence leaves a "d", and stripping the non-printable ISO-8859-1 or ASCII leaves a ",Ad(B". -- Terry To Unsubscribe: send mail to majordomo@FreeBSD.org with "unsubscribe freebsd-chat" in the body of the message
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?3CB5F49B.C21B24E9>