Spam Filtering
From Apis Networks Wiki
Contents |
Overview
SpamAssassin is an open-source software application that intelligently scans messages to determine spam/non-spam status of e-mail.
Usage
SpamAssassin is implicitly invoked through the global maildrop filter (/etc/maildroprc) for each site. No further steps are necessary to enable SpamAssassin for a site.
Global Preferences
Default preferences may be set for all users by editing /etc/mail/spamassassin/global_prefs. Global directives may be overridden by a user through the file .spamassassin/user_prefs residing in the user's home directory. Global preferences accept the same directives as user preferences defined in Mail::SpamAssassin::Conf(3); this includes, but is not limited to custom scoring and rules.
Custom Scoring
Scores for default rules may be overridden globally in /etc/mail/spamassassin/global_prefs and per-user in .spamassassin/user_prefs within the user's home directory. Syntax follows the form
- score SYMBOLIC_TEST_NAME n.nn
SYMBOLIC_TEST_NAME is the symbolic name used by SpamAssassin in the test. n.nn represents the a number to apply to the message score. Examples include -10, 10, 5.5, and 3.25.
- Assigning a negative value subtracts from the computed score, while a positive number adds. A value of 5 or greater flags the message as spam.
- Setting a rule's score to 0 will disable that rule from running.
- A score surrounded by parenthesis '()' is considered to be relative to the already set score. ie: '(3)' means increase the score for this rule by 3 points.
# Flag messages receiving a 99% probability as definitive spam
score BAYES_99 10
# Times are tough and I have bad credit - help me get out of debt, Mother Russia!
score BAD_CREDIT -2
# A custom rule defined globally in global_prefs
score A_GLOBAL_TEST 5
- See Also
- SpamAssassin Wiki: Tests Performed - A full listing of tests, symbolic names, and detailed explanations
Custom Rules
Additional rules consist of a body and score directive. describe is an optional directive that provides an explanation of the symbolic name if a message is flagged as spam. Two examples are listed below, one that checks the body of a message and another that inspects the transmission headers (From/Subject/To...):
# This rule checks the body of a message for "test"
# Messages found containing the string "test" in body receive an additional 0.1 points
body LOCAL_DEMONSTRATION_RULE /test/
score LOCAL_DEMONSTRATION_RULE 0.1
describe LOCAL_DEMONSTRATION_RULE This is a simple test rule
# This rule checks the transmission headers for the "From" field
# If the From field contains "bookmyspam.net", then 10 points are added
header LOCAL_DEMONSTRATION_FROM From =~ /bookmyspam\.net/i
score LOCAL_DEMONSTRATION_FROM 10
describe LOCAL_DEMONSTRATION_FROM Worst spamvertise campaign ever
- See Also
- SpamAssassin Wiki: Writing Rules
Configuring Options
Options apply to both local and global configurations. The following is a partial list of popular options to tweak SpamAssassin. A full listing is available within the SpamAssassin::Conf manual.
- required_score n.nn
- Set the threshold that a letter must score before it is considered spam. (Default: 5)
- rewrite_header subject phrase
- Alter messages tagged as spam by prepending phrase to the subject line. (Default: [SPAM] (_SCORE_))
- report_safe (0 | 1 | 2)
- Attach original message as message/rfc822 attachment (or text/plain) when report_safe is set to 1 or 2. (Default: 0)
- bayes_min_ham_num n
- minimum number of non-spam messaged learned before the Bayes is activated. Bayesian rules greatly improve accuracy, but require data to train. (Default: 200)
- bayes_min_spam_num n
- same as above, but minimum number of spam messaged learned. (Default: 200)
Problems
Is SpamAssassin filtering my messages?
Barring rare and extraneous circumstances, yes. SpamAssassin is internally monitored by the integrity daemon to ensure it is up and running. Since adding a SpamAssassin check to the daemon, we have seen zero reported cases of SpamAssassin being offline and only a handful of false positives within a year. Spam that slips through is due to either (a) low Bayesian scoring or (b) a recently compromised computer sending out spam. You may view the transmission headers. In Thunderbird select View -> Message Details; Outlook 2007, expand the Options menu. Below is an example of a correctly filtered message:
X-Spam-Checker-Version: SpamAssassin 3.2.4 (2008-01-01) on
assmule.apisnetworks.com
If the message was labeled as spam, then the X-Spam-Flag will report YES otherwise NO
X-Spam-Score: 7.1
X-Spam-Flag: YES
Further, if the message is labeled as spam, then the subject will have [SPAM] at the front followed by its score. You may use this attribute to filter within your e-mail client or filter server-side with the SpamAssassin Wizard
Scoring
How can I enhance scoring?
If your account has received an adequate volume of e-mail (200 spams + 200 non-spam messages), Bayesian filtering will automatically activate. As your account ages, Bayesian filtering will progressively become Edit .spamassassin/user_prefs and to increase Bayesian scoring of 95 and above:
score BAYES_99 7
score BAYES_95 5
This should greatly enhance the server's ability to catch new spam, but only if you have an adequate number of learned messages.
It is also recommended you run sa-learn periodically on missed spam to retrain the filter.
See also: Streamlining SpamAssassin's Learning Process
Determining how many spams/hams Are learned
sa-learn --dump magic will display the Bayesian metadata.
[msaladna@assmule ~]$ sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 140684 0 non-token data: nspam
0.000 0 33568 0 non-token data: nham
0.000 0 136013 0 non-token data: ntokens
0.000 0 1207054271 0 non-token data: oldest atime
0.000 0 1207402325 0 non-token data: newest atime
0.000 0 1207402432 0 non-token data: last journal sync atime
0.000 0 1207399817 0 non-token data: last expiry atime
0.000 0 345600 0 non-token data: last expire atime delta
0.000 0 15349 0 non-token data: last expire reduction count
nham: number of hams learned
nspam: number of spams learned
ntokens: number of tokens (words) within the database
In this example the database has 140,684 spams, 33,568 hams, and the oldest entry is from April 1st (1207054271).
