Skip site navigation (1)Skip section navigation (2)
Date:      Sat, 14 Oct 2006 14:38:08 +0200
From:      Erik Norgaard <norgaard@locolomo.org>
To:        Beech Rintoul <freebsd@alaskaparadise.com>
Cc:        freebsd-questions@freebsd.org
Subject:   Re: Non English Spam
Message-ID:  <4530DA30.7060004@locolomo.org>
In-Reply-To: <200610131712.46822.freebsd@alaskaparadise.com>
References:  <200610131712.46822.freebsd@alaskaparadise.com>

next in thread | previous in thread | raw e-mail | index | archive | help
Beech Rintoul wrote:
> I'm getting a ton of spam every day  that comes from China, Japan and Korea. 
> Spam Assassin completely ignores it because it has all non-english characters 
> and slows kmail to a crawl loading. Is there a way to filter on non-english 
> either using Spam Assassin or procmail? 

I get none after adding simple filter rules for postfix:

# Accepted mime headers: (ASCII, UTF-8 and ISO-8859-X)
/^Content-Type:.*?charset\s*=\s*"?(us-ascii|iso-8859-\d+|utf-8)"?/
     OK     HDR2000 Accepted charset: $1

Strictly you can reject every other characterset, but I chose to make it 
explicit:

# Reject specific character sets
# Chinese, Japanese and Korean
/^Content-Type:.*?charset\s*=\s*"?(Big5|gb2312|euc-cn)"?/
     REJECT HDR2100: Unaccepted character set: "$1"
/^Content-Type:.*?charset\s*=\s*"?(euc-kr|iso-2022-kr)"?/
     REJECT HDR2110: Unaccepted character set: "$1"
/^Content-Type:.*?charset\s*=\s*"?(iso-2022-\w+|euc-jp|shift_jis)"?/
     REJECT HDR2120: Unaccepted character set: "$1"
# Cyrrilic character sets: Russian/Ukrainian
/^Content-Type:.*?charset\s*=\s*"?(koi8-(?:r|u))"?/
     REJECT HDR2200: Unaccepted character set: "$1"
/^Content-Type:.*?charset\s*=\s*"?(windows-(?:1250|1251))"?/
     REJECT HDR2210: Unaccepted character set: "$1"

And then you may want a catchup rule to catch unknown character sets.

/^Content-Type:.*?charset\s*=\s*"?(\w?)"?/
     WARN   HDR2299: Unknown character set: "$1"

you may change WARN to REJECT.

I have noted however, that some subscribers to this list write english 
encoded in one of the above character sets, I don't know enough about 
the character set definition, but it seems that English characters are a 
subset of any character set?

What is the recommended policy here? Should subscribers be advised to 
change character set when posting to the list?

Cheers, Erik
-- 
Ph: +34.666334818                      web: http://www.locolomo.org
X.509 Certificate: http://www.locolomo.org/crt/8D03551FFCE04F0C.crt
Key ID: 69:79:B8:2C:E3:8F:E7:BE:5D:C3:C3:B1:74:62:B8:3F:9F:1F:69:B9



Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4530DA30.7060004>