Date: Sat, 14 Oct 2006 14:38:08 +0200 From: Erik Norgaard <norgaard@locolomo.org> To: Beech Rintoul <freebsd@alaskaparadise.com> Cc: freebsd-questions@freebsd.org Subject: Re: Non English Spam Message-ID: <4530DA30.7060004@locolomo.org> In-Reply-To: <200610131712.46822.freebsd@alaskaparadise.com> References: <200610131712.46822.freebsd@alaskaparadise.com>
next in thread | previous in thread | raw e-mail | index | archive | help
Beech Rintoul wrote: > I'm getting a ton of spam every day that comes from China, Japan and Korea. > Spam Assassin completely ignores it because it has all non-english characters > and slows kmail to a crawl loading. Is there a way to filter on non-english > either using Spam Assassin or procmail? I get none after adding simple filter rules for postfix: # Accepted mime headers: (ASCII, UTF-8 and ISO-8859-X) /^Content-Type:.*?charset\s*=\s*"?(us-ascii|iso-8859-\d+|utf-8)"?/ OK HDR2000 Accepted charset: $1 Strictly you can reject every other characterset, but I chose to make it explicit: # Reject specific character sets # Chinese, Japanese and Korean /^Content-Type:.*?charset\s*=\s*"?(Big5|gb2312|euc-cn)"?/ REJECT HDR2100: Unaccepted character set: "$1" /^Content-Type:.*?charset\s*=\s*"?(euc-kr|iso-2022-kr)"?/ REJECT HDR2110: Unaccepted character set: "$1" /^Content-Type:.*?charset\s*=\s*"?(iso-2022-\w+|euc-jp|shift_jis)"?/ REJECT HDR2120: Unaccepted character set: "$1" # Cyrrilic character sets: Russian/Ukrainian /^Content-Type:.*?charset\s*=\s*"?(koi8-(?:r|u))"?/ REJECT HDR2200: Unaccepted character set: "$1" /^Content-Type:.*?charset\s*=\s*"?(windows-(?:1250|1251))"?/ REJECT HDR2210: Unaccepted character set: "$1" And then you may want a catchup rule to catch unknown character sets. /^Content-Type:.*?charset\s*=\s*"?(\w?)"?/ WARN HDR2299: Unknown character set: "$1" you may change WARN to REJECT. I have noted however, that some subscribers to this list write english encoded in one of the above character sets, I don't know enough about the character set definition, but it seems that English characters are a subset of any character set? What is the recommended policy here? Should subscribers be advised to change character set when posting to the list? Cheers, Erik -- Ph: +34.666334818 web: http://www.locolomo.org X.509 Certificate: http://www.locolomo.org/crt/8D03551FFCE04F0C.crt Key ID: 69:79:B8:2C:E3:8F:E7:BE:5D:C3:C3:B1:74:62:B8:3F:9F:1F:69:B9
Want to link to this message? Use this URL: <https://mail-archive.FreeBSD.org/cgi/mid.cgi?4530DA30.7060004>