Spam Filtering

From Apis Networks Wiki

Jump to: navigation, search

Contents

Overview

SpamAssassin is an open-source software application that intelligently scans messages to determine spam/non-spam status of e-mail.

Usage

SpamAssassin is implicitly invoked through the global maildrop filter (/etc/maildroprc) for each site. No further steps are necessary to enable SpamAssassin for a site.

Global Preferences

Default preferences may be set for all users by editing /etc/mail/spamassassin/global_prefs. Global directives may be overridden by a user through the file .spamassassin/user_prefs residing in the user's home directory. Global preferences accept the same directives as user preferences defined in Mail::SpamAssassin::Conf(3); this includes, but is not limited to custom scoring and rules.

Custom Scoring

Scores for default rules may be overridden globally in /etc/mail/spamassassin/global_prefs and per-user in .spamassassin/user_prefs within the user's home directory. Syntax follows the form

score SYMBOLIC_TEST_NAME n.nn

SYMBOLIC_TEST_NAME is the symbolic name used by SpamAssassin in the test. n.nn represents the a number to apply to the message score. Examples include -10, 10, 5.5, and 3.25.

  • Assigning a negative value subtracts from the computed score, while a positive number adds. A value of 5 or greater flags the message as spam.
  • Setting a rule's score to 0 will disable that rule from running.
  • A score surrounded by parenthesis '()' is considered to be relative to the already set score. ie: '(3)' means increase the score for this rule by 3 points.
# Flag messages receiving a 99% probability as definitive spam
score BAYES_99 10
# Times are tough and I have bad credit - help me get out of debt, Mother Russia!
score BAD_CREDIT -2
# A custom rule defined globally in global_prefs
score A_GLOBAL_TEST 5
 
See Also
  • SpamAssassin Wiki: Tests Performed - A full listing of tests, symbolic names, and detailed explanations

Custom Rules

Additional rules consist of a body and score directive. describe is an optional directive that provides an explanation of the symbolic name if a message is flagged as spam. Two examples are listed below, one that checks the body of a message and another that inspects the transmission headers (From/Subject/To...):

# This rule checks the body of a message for "test"
# Messages found containing the string "test" in body receive an additional 0.1 points
body LOCAL_DEMONSTRATION_RULE   /test/
score LOCAL_DEMONSTRATION_RULE 0.1
describe LOCAL_DEMONSTRATION_RULE This is a simple test rule
 
# This rule checks the transmission headers for the "From" field
# If the From field contains "bookmyspam.net", then 10 points are added
header LOCAL_DEMONSTRATION_FROM From =~ /bookmyspam\.net/i
score LOCAL_DEMONSTRATION_FROM  10
describe LOCAL_DEMONSTRATION_FROM Worst spamvertise campaign ever
See Also

Configuring Options

Options apply to both local and global configurations. The following is a partial list of popular options to tweak SpamAssassin. A full listing is available within the SpamAssassin::Conf manual.

required_score n.nn
Set the threshold that a letter must score before it is considered spam. (Default: 5)
rewrite_header subject phrase
Alter messages tagged as spam by prepending phrase to the subject line. (Default: [SPAM] (_SCORE_))
report_safe (0 | 1 | 2)
Attach original message as message/rfc822 attachment (or text/plain) when report_safe is set to 1 or 2. (Default: 0)
bayes_min_ham_num n
minimum number of non-spam messaged learned before the Bayes is activated. Bayesian rules greatly improve accuracy, but require data to train. (Default: 200)
bayes_min_spam_num n
same as above, but minimum number of spam messaged learned. (Default: 200)

Problems

Is SpamAssassin filtering my messages?

Barring rare and extraneous circumstances, yes. SpamAssassin is internally monitored by the integrity daemon to ensure it is up and running. Since adding a SpamAssassin check to the daemon, we have seen zero reported cases of SpamAssassin being offline and only a handful of false positives within a year. Spam that slips through is due to either (a) low Bayesian scoring or (b) a recently compromised computer sending out spam. You may view the transmission headers. In Thunderbird select View -> Message Details; Outlook 2007, expand the Options menu. Below is an example of a correctly filtered message:

X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on 
assmule.apisnetworks.com

If the message was labeled as spam, then the X-Spam-Flag will report YES otherwise NO

X-Spam-Score: 7.1 
X-Spam-Flag: YES

Further, if the message is labeled as spam, then the subject will have [SPAM] at the front followed by its score. You may use this attribute to filter within your e-mail client or filter server-side with the SpamAssassin Wizard

Scoring

How can I enhance scoring?

If your account has received an adequate volume of e-mail (200 spams + 200 non-spam messages), Bayesian filtering will automatically activate. As your account ages, Bayesian filtering will progressively become Edit .spamassassin/user_prefs and to increase Bayesian scoring of 95 and above:

score BAYES_99 7
score BAYES_95 5

This should greatly enhance the server's ability to catch new spam, but only if you have an adequate number of learned messages.

It is also recommended you run sa-learn periodically on missed spam to retrain the filter.

See also: Streamlining SpamAssassin's Learning Process

Determining how many spams/hams Are learned

sa-learn --dump magic will display the Bayesian metadata. [msaladna@assmule ~]$ sa-learn --dump magic 0.000 0 3 0 non-token data: bayes db version 0.000 0 140684 0 non-token data: nspam 0.000 0 33568 0 non-token data: nham 0.000 0 136013 0 non-token data: ntokens 0.000 0 1207054271 0 non-token data: oldest atime 0.000 0 1207402325 0 non-token data: newest atime 0.000 0 1207402432 0 non-token data: last journal sync atime 0.000 0 1207399817 0 non-token data: last expiry atime 0.000 0 345600 0 non-token data: last expire atime delta 0.000 0 15349 0 non-token data: last expire reduction count

nham: number of hams learned
nspam: number of spams learned
ntokens: number of tokens (words) within the database

In this example the database has 140,684 spams, 33,568 hams, and the oldest entry is from April 1st (1207054271).

Personal tools