Date: Fri, 28 May 2010 23:45:58 +0300 From: Nikos Vassiliadis <nvass9573@gmx.com> To: Polytropon <freebsd@edvax.de> Cc: Gary Kline <kline@thought.org>, FreeBSD Mailing List <freebsd-questions@freebsd.org> Subject: Re: any shortcuts to doc to ascii? Message-ID: <4C002B86.5090007@gmx.com> In-Reply-To: <20100528090057.87144ef4.freebsd@edvax.de> References: <20100527013843.GA40751@thought.org> <20100527050302.da39c258.freebsd@edvax.de> <20100527233607.GD19297@thought.org> <20100528090057.87144ef4.freebsd@edvax.de>
next in thread | previous in thread | raw e-mail | index | archive | help
Polytropon wrote: > On Thu, 27 May 2010 16:36:08 -0700, Gary Kline <kline@thought.org> wrote: >> i don't see any ascii suffix [for OOo]. i saved as .txt. > > This should be right. The .txt extension refers to ASCII text, > at least in standard-compliant operating systems. > > > >> same krap. the \x94, x9d, \x9c... same with catdoc. i'll >> try antiword. [forgot about that. ] > > This makes me believe that the original DOC file has been created > with a wrong character set or language setting. "Windows" - as far > as I know - does not use standard locales such as all other systems > do, but uses an arbitrary setting. > It is a valid UTF-8 encoded text: [nik@moby ~]$ python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' | file - /dev/stdin: UTF-8 Unicode text You'll be able to see the character if you fire up a UTF-8 capable terminal with proper locale settings. [nik@moby ~]$ LC_ALL=en_US.UTF-8 xterm -u8 After that, just print the char: python -c 'print "Don%c%c%ct" % (0xe2, 0x80, 0x99)' and use copy & paste to pass it to tr to translate it to something else, for example: tr ' "'" < $file > $output > Another idea may be that the character that you think should be > an apostrophe isn't an apostrophe. I often do see this in german > texts with misplaces apostrophes that are in fact accent grave > or accent acute, or a character from UTF-8 that just looks like > an apostrophe. For example, if the original document contains > > We don`t > > and this ` is not a real ', then conversion tools will of course > use the "escape notation" for this unknown character. Indeed, the standard tool for encoding translations, iconv, chocks on this. Yet, it worked when I tried to convert from utf-8 to greek encoding('iconv -f utf-8 -t iso-8859-7'). Some info on the char: http://www.fileformat.info/info/unicode/char/2019/index.htm HTH, Nikos
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4C002B86.5090007>