From owner-freebsd-questions@FreeBSD.ORG Mon Jan 5 07:28:52 2004 Return-Path: Delivered-To: freebsd-questions@freebsd.org Received: from mx1.FreeBSD.org (mx1.freebsd.org [216.136.204.125]) by hub.freebsd.org (Postfix) with ESMTP id DB3D916A4CE for ; Mon, 5 Jan 2004 07:28:52 -0800 (PST) Received: from smtp.infracaninophile.co.uk (ns0.infracaninophile.co.uk [81.2.69.218]) by mx1.FreeBSD.org (Postfix) with ESMTP id 32DDD43D41 for ; Mon, 5 Jan 2004 07:28:50 -0800 (PST) (envelope-from m.seaman@infracaninophile.co.uk) Received: from happy-idiot-talk.infracaninophile.co.uk (localhost [127.0.0.1]) i05FSixn003035 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO); Mon, 5 Jan 2004 15:28:44 GMT (envelope-from matthew@happy-idiot-talk.infracaninophile.co.uk) Received: (from matthew@localhost)id i05FSfwo003030; Mon, 5 Jan 2004 15:28:41 GMT (envelope-from matthew) Date: Mon, 5 Jan 2004 15:28:41 +0000 From: Matthew Seaman To: zhangweiwu@realss.com Message-ID: <20040105152841.GA2784@happy-idiot-talk.infracaninophile.co.uk> Mail-Followup-To: Matthew Seaman , zhangweiwu@realss.com, questions@freebsd.org References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable In-Reply-To: User-Agent: Mutt/1.5.5.1i X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL,BAYES_00 autolearn=ham version=2.61 X-Spam-Checker-Version: SpamAssassin 2.61 (1.212.2.1-2003-12-09-exp) on happy-idiot-talk.infracaninophile.co.uk cc: questions@freebsd.org Subject: Re: help me with this sed expression X-BeenThere: freebsd-questions@freebsd.org X-Mailman-Version: 2.1.1 Precedence: list List-Id: User questions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 05 Jan 2004 15:28:53 -0000 On Mon, Jan 05, 2004 at 07:49:43PM +0800, Zhang Weiwu wrote: > Hello. I've worked an hour to figure out a serial of sed command to proce= ss=20 > some text (without any luck, you kown I'm kinda newbie). I really=20 > appreciate your help. >=20 > The original text file is in this form -- for each line: > one Chinese word then one or two English word seperated by space. >=20 > I wish to change to: > 1) target file: one English word, then a space, then a Chinese word=20 > coorisponding to that English word. > 2) if in the original file one Chinese word has more than one English wor= d=20 > following in the same line, repeat the Chinese word to satisfy 1). >=20 > Define: Chinese word =3D one or more continous bytes of data where each b= yte=20 > is greater then 128 in value. (it is true in GB2312 Chinese charset which= =20 > this email is written in.) > Define: English word =3D one or more continous bytes of [a-z]. >=20 > Say, for the original file: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > ??a av > ????????aaav > ????????aacm > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > The target file should be: > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > a ?? > av ?? > aaav ???????? > aacm ???????? > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > I tried to do things like s/\(.*\)\([a-z]*\)/\2 \1/ but the first \(.*\) = is=20 > too greedy and included the rest [a-z]. Dunno about sed(1) but you could do the job like this: perl -ne '($c, $e) =3D m/^([\x{81}-\x{ff}]+)([a-z ]+)\z/; foreach $x (s= plit / /, $e) { print "$c $x\n"; }' filename Cheers, Matthew --=20 Dr Matthew J Seaman MA, D.Phil. 26 The Paddocks Savill Way PGP: http://www.infracaninophile.co.uk/pgpkey Marlow Tel: +44 1628 476614 Bucks., SL7 1TH UK