Date: Fri, 28 May 2010 23:45:58 +0300 From: Nikos Vassiliadis <nvass9573@gmx.com> To: Polytropon <freebsd@edvax.de> Cc: Gary Kline <kline@thought.org>, FreeBSD Mailing List <freebsd-questions@freebsd.org> Subject: Re: any shortcuts to doc to ascii? Message-ID: <4C002B86.5090007@gmx.com> In-Reply-To: <20100528090057.87144ef4.freebsd@edvax.de> References: <20100527013843.GA40751@thought.org> <20100527050302.da39c258.freebsd@edvax.de> <20100527233607.GD19297@thought.org> <20100528090057.87144ef4.freebsd@edvax.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Polytropon wrote:
> On Thu, 27 May 2010 16:36:08 -0700, Gary Kline <kline@thought.org> wrote:
>> i don't see any ascii suffix [for OOo]. i saved as .txt.
>
> This should be right. The .txt extension refers to ASCII text,
> at least in standard-compliant operating systems.
>
>
>
>> same krap. the \x94, x9d, \x9c... same with catdoc. i'll
>> try antiword. [forgot about that. ]
>
> This makes me believe that the original DOC file has been created
> with a wrong character set or language setting. "Windows" - as far
> as I know - does not use standard locales such as all other systems
> do, but uses an arbitrary setting.
>
It is a valid UTF-8 encoded text:
[nik@moby ~]$ python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' | file -
/dev/stdin: UTF-8 Unicode text
You'll be able to see the character if you fire up a UTF-8 capable
terminal with proper locale settings.
[nik@moby ~]$ LC_ALL=en_US.UTF-8 xterm -u8
After that, just print the char:
python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)'
and use copy & paste to pass it to tr to translate it to something else,
for example:
tr ' "'" < $file > $output
> Another idea may be that the character that you think should be
> an apostrophe isn't an apostrophe. I often do see this in german
> texts with misplaces apostrophes that are in fact accent grave
> or accent acute, or a character from UTF-8 that just looks like
> an apostrophe. For example, if the original document contains
>
> We don`t
>
> and this ` is not a real ', then conversion tools will of course
> use the "escape notation" for this unknown character.
Indeed, the standard tool for encoding translations, iconv, chocks on
this. Yet, it worked when I tried to convert from utf-8 to greek
encoding('iconv -f utf-8 -t iso-8859-7'). Some info on the char:
http://www.fileformat.info/info/unicode/char/2019/index.htm
HTH, Nikos
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C002B86.5090007>
