CNS » USG » TechNotes » Our spam filtering

How GPU filters email spam

Our central email server machine, gpu.utcc, has a strong set of email filters in an attempt to reject as much spam as possible while not rejecting legitimate mail. This document is an attempt both to inform people what email filtering GPU is doing and to let people know about novel or just effective spam filtering techniques we're using.

Following this technical note requires a basic familiarity with SMTP, the protocol used to deliver email across the Internet. We've written up an introduction to SMTP if you're not already familiar with it.

All of the spam filtering here is described from the perspective of 'outside' email, email being sent by someone we would not allow to relay freely through us. We are more lenient (sometimes far more so) on sending machines we consider 'internal'.

Disclaimer: GPU's antispam filters are in constant evolution, and this document is practically guaranteed to not be complete and comprehensive at any given time. Further, you should not blindly copy any of these techniques for use on your own systems; antispam filtering, with its tradeoffs between the amount of spam rejected and the possibility of false positives, is something that every site needs to tune for local preferences.

Many if not all of our filtering rules have exceptions to them that we're not going to enumerate, because our basic rule of thumb is a user's desire to receive legitimate email trumps pretty much everything. Our ultimate purpose is to serve our users; spam filtering is justified only so much and only when they consider it a service, instead of an obstacle.

SMTP-level rejection versus user-level filtering

One can deal with spam in two different places: one can attempt to reject it system-wide at the SMTP level, or one can provide tools and leave it to the users to filter their mailboxes or not as they desire. We feel that there are significant benefits to doing lots of antispam filtering at the SMTP level, and so we have chosen to deploy aggressive techniques there. There are practical benefits (less system load, less network utilization, less need to generate and send bounces), but they're lesser ones compared to a major win.

That major win is mail rejected at the SMTP level is (very) visible to the senders. If filters mis-reject a message, the sender gets a delivery failure notification and knows that their email didn't get through, and possibly why. They can then use a number of means to deal with it, including getting in touch with us to get us to change the filters. Almost all user-level filtering silently swallows messages, which gives the sender no feedback that their message has not gotten through. Everyone has tales of messages that have been lost this way, often without either sender or receiver really knowing.

Some philosophy

On a theoretical basis, we want to reject spam and accept legitimate email. On a practical basis we have to use heuristics, and we must balance the precision of a rule against the amount of effort it is to maintain and the likelihood of false positives. Sometimes we pick general rules that are easy to implement and to explain but that are not strictly speaking 'spam rejection' rules.

The most prominent example, explained simply, is that we don't accept SMTP connections from dynamically assigned IP addresses; we require such people to route their email through their ISP's mail server. Dynamically assigned addresses and anonymous ISP customers in general have repeatedly been a source of problems, ranging from spammers buying expendable accounts to the recent scourge of open proxy based spam. However, we generally only notice dynamically assigned areas when they cause us problems. A number of specific filtering rules are expressions of this general decision.

Another general case is that we are often unkind to hosts that show signs of being badly administered. Our view tends to be that these hosts cause us problems even when they are not sending us spam; until one of our users actively wants to get email from them, we're going to encourage them to fix their problems.

Phases of filtering

The best way to discuss our mail filters is by covering each stage of SMTP mail delivery and discussing the filters that apply during it. There are some filters that extend across multiple SMTP stages, but most don't.

Initial SMTP connection

We maintain a filter list of bad hosts and network areas that can't talk to our SMTP port at all; their SMTP packets are silently discarded. Such blocks are the most unfriendly because the mail must time out on the sender's system, netting them cryptic error messages that imply network failures instead of friendly ones that admit to the real reason. Long-term blocks are applied to 'hostile' network areas and machines that have historically hammered rapidly on us in the face of other blocks. Generally we want such hostile areas to feel the pain of undeliverable mail piling up in their spool areas for days and days; maybe it will cause them to do something about the problem.

The filter list is reinitialized each time GPU reboots, currently once a week. During the week we add various spam sources and high volume sources of other rejections to the filters on a dynamic basis. If they cease being a problem or otherwise change their ways, they'll get another chance next week.

The SMTP daemon then checks the connection against a tcpwrappers control file. We use tcpwrappers rejection for several cases:

We have done enough experimentation and monitoring to know that we cannot insist on valid IP to hostname mappings in general unless we wanted to miss a lot of good email; a lot of perfectly good hosts have missing or bad data for this. It would also require a different implementation of the check, as the current one doesn't properly handle temporary DNS failures. For areas that we are one step short of blocking outright we consider this okay, but for general use it would not be.

Win: IP address greylisting

Greylisting is the generic name for techniques that delay email in order to determine that the sending machine properly implements SMTP mail retries. Greylisting is useful because many sources of spam do not implement such retries; some because they don't even notice errors, such as spam through open proxies, and some because the spammers have written their software to avoid that overhead.

GPU implements IP address greylisting. Our SMTP daemon maintains an internal list of IP addresses that it's seen connections from since it was last restarted (preloaded from a file to avoid email delivery delays for common valid SMTP senders). When a new IP address connects to us, it gets a 4xx error for the first half hour of its attempts; after that it's remembered and allowed through.

Implementing this caused an immediate drop in the amount of open proxy spam we received. As spam through open proxies is currently the largest single spam problem we see, this was a significant win.

DNS Blocklist checks

GPU checks connections that have made it this far against a number of DNS blocklists, assuming they are not in a local whitelist (which insures that if, for example, UTORMail ever appears in a a DNSBL we will not reject email from it). For technical reasons, the rejection only shows up when we reply to the SMTP HELO, but we consider it a different phase of checking. Our current set of major DNSBLs is:

Of these, we can pretty much unreservedly recommend using the SBL and blackholes.easynet.nl, and at least cbl.abuseat.org to block open proxies. The first two are high-quality blocklists of spammer IP address ranges, and the third is generally considered the best open proxy DNSBl -- and you want to block open proxies these days.

To reduce load on DNSBls we check them in a defined order and stop after the first hit. Our code supports checking a DNSBl without rejecting the message, and we log all hits, which gives us a convenient method for testing out a DNSBl; we put it in at the end of the list and see how many hits were rejected for other reasons or look like spam, versus how many seem like legitimate email. We strongly recommend this practice: you should always test the effects of DNS blocklists on your email before turning them live.

HELO

We require HELO names to pass a series of tests:

Checking HELO names is contentious, but appears to be justified by the ESMTP RFC. It defeats a certain amount of spammers (who seem to forge legitimate HELO names far less often than they forge legitimate domains in the envelope sender, possibly because far fewer people check), but it also rejects a lot of machines that we consider badly administered. Under our philosophy, so far we consider this acceptable.

Common MAIL FROM and RCPT TO checks

The address given for both MAIL FROM (if non-null) and RCPT TO must pass certain checks:

This check is uncontroversial, as any number of large organizations (including AOL) insist on this, but correspondingly not all that useful any more. It is still worth performing on a technical correctness front, because it insures that your system has at least a plausible chance to deliver to an address that you're being handed (whether for delivery of the message, or for delivery of any bounces it may provoke).

Further, these days you really want to verify that local usernames are valid before you accept the email. There are simply too many spammers forging too many random usernames, either to try to deliver to them or as the envelope sender on their spam, to blindly accept all local usernames. (Admittedly we may be biased because GPU has been used by spammers for forged envelope senders in high volume for going on a year now.)

MAIL FROM

We perform additional checks for MAIL FROM:

There are two more MAIL FROM checks that need to be described in more detail.

We check that the domain of the envelope sender doesn't have NS records whose hostnames or IP addresses would be rejected by our SMTP tcpwrappers control file. This is the first instance of checks aimed at a general rule: check for things the spammers can't change easily. It is easy for spammers to register a lot of domain names, but it is harder for spammers to find spam-friendly domain name service. Often spammers reuse either the same spam-friendly DNS services, or site their DNS servers under revolving names out of known spam zones of the network. By checking for DNS service instead of just the domain name, we take aim at something that it's harder for spammers to change.

The badsrc check is difficult to describe concisely, but it works like this. First, we take a list of domain name tails, such as example.co.xx and co.yy, and see if the HELO name ends in any of them. If it does, we require the envelope sender domain to also have the same domain name tail; if it does not, it is rejected with a message about how we do not accept that envelope sender address from that host. For example, if we saw HELO sample.example.com and example.com is in the list of domain name tails, we will accept envelope senders of user@sample2.example.com and user@example.com, but not user@example2.com. (The check is named after the variable used to keep a list of the domain name tails.)

The badsrc check is primarily for known or suspected open SMTP relays that we feel that we cannot block outright, especially if they are multistage relays (a customer is open, and relays their mail outbound through their ISP). We hope that allowing direct ISP email addresses to get through will avoid the problems that outright blocking causes.

RCPT TO

At RCPT TO time we apply the common checks described above, check for attempts to use us as a relay, and then look for spamtrap addresses we call flushot addresses.

Like many University of Toronto groups that have been around for a long time, we have accumulated a large number of no longer valid email addresses that have appeared on the Internet for long enough that they are on commonly used spammer address lists. When we see an email message to one of these addresses, we add the envelope sender address (previously stashed away temporarily for just this purpose) to a flushot file, along with a note about the circumstances.

In addition to 4xx stalls during subsequent MAIL FROMs, we also stall additional RCPT TOs, in case the spammer is trying to send this message to additional addresses. We could also give a 4xx during the DATA phase, in case the spammer had already given some valid envelope recipients before he hit a spamtrap, but that's not currently implemented.

Spamtraps are useful because spammers often hit them early in spam runs, before they get many (or any) of your real users. We see this happening all the time. In addition, the archive of flushot hits (with the associated information) can be mined for the amount of hits you see from various address patterns, or various source hosts or domains.

DATA

Finally, we subject the actual email message itself to various checks (this is sometimes called content checking, as distinct from envelope checks and source address checks). These checks are done when we've received the final . that terminates the email message. Locally, we tend to call this set of checks body checks, although they also have both the envelope headers and the message headers available.

Body checks have the potential to be very dangerous; the classical case is people writing to discuss or complain about spam. For this reason, body checks are not done on email forwarded to us by UTORmail, and some body checks are not done for email to certain administrative addresses. Because of the danger potential and our desire to closely monitor body checks, we insure that they log detailed information for rejections and warnings to the system logs.

Our body checks are the most complex code we have, because we understand and decode the structure of the mail message. We parse complex MIME messages, we decode encoded MIME body parts as required, and we even understand HTML. This means that obfuscation techniques that spammers use to hide their messages from simple text-based scanning software don't work on us; if our body checker can't read it, the user's mail reader probably can't either.

While some obfuscation techniques continue to work, there's a reason we don't care: the presence of obfuscation means the message has something to hide, which is a giveaway sign of spam. It's often easier to recognize obfuscation than look for the thing being obfuscated.

The body checks look for the following things:

Because we understand the message structure and decode it properly, we can make our checks quite safe. For example, we only look for bad HTML encodings in messages or message-parts marked as HTML, so that a message discussing the technical details of such spam would pass through.

Win: Checking website hosting

In addition we perform one final check: we look up the IP addresses of every distinct web site mentioned in URLs in the message and check to see if they're in the SBL. If an IP address is, and if the IP address is also in a list of bad network areas we call the danger zone, we reject the message (if the IP address is SBL-listed but not in the danger zone, we make a note in our logs).

This is the second instance of a check aimed at the general rule: check for things the spammers can't change easily, like the nameserver check mentioned earlier. There are not very many places on the Internet that will give connectivity to spammer web sites; as spammers must have reliable web site hosting in order to get benefit from their spam, they rapidly gravitate to the few such places. When they do, and the places are SBL listed, we can suddenly easily reject them. And we do; we see a frequent stream of rejections, often from new spammers or spam domains we'd never heard of before.

We require both SBL listing plus presence in our danger zone because this gives us more control over rejections. Given the potential dangers of body checks, we feel much safer in essentially defaulting to accepting email unless we're sure (by adding a network area to the danger zone) that we want to reject it. Also unfortunately there are some large ISPs that continue to host spammer web sites and thus get addresses SBL listed, yet host real web sites we can't refuse; the largest current case is Yahoo (as of 2003-11-07).

Overheads in filtering

Filtering incoming SMTP email this aggressively does have a certain amount of effects on incoming email and the system. However, we experience almost no system load on a mail volume of about 2,000 messages a day (on what is now relatively slow hardware). Most of the effects are delays on incoming messages due to the need to do a lot of DNS lookups. We might experience more CPU overhead with more mail volume, or if we received a lot of large messages that required complex decoding in our body checks.

Our overall view is that we do not mind if email messages take a bit slower to get through if it results in less spam (especially significantly less). We would care if there are significant delays for many messages, but we've found no indication that there are any; a delay of thirty to sixty seconds extra in a SMTP client transferring a message to our machine can be totally dwarfed by backed up queues on the SMTP client itself, perhaps because of a volume of unfiltered spam.

Overall effectiveness (A conclusion of sorts)

Partly because of our large number of filters and the variety in where SMTP sending attempts are rejected, it's hard to tell how many spam messages we reject on an ongoing basis. The primary author mostly relies on how much spam he gets or doesn't get in his own mailbox, with anecdotal reports from other people he talks to. On this basis, our spam filters are a shining success.

We receive essentially zero reports of mis-rejected email from our users. Unfortunately this may be because either they don't know about it or because they don't know to complaint to us.

The Implementation Details

This section is for people interested in the technical details of how our antispam filtering is implemented, as opposed to what it looks for and what it filters.

A brief Zmailer overview

GPU runs a hacked version of ZMailer 2.2 as its mail server (the MTA, Mail Transport Agent). ZMailer splits the job of handling email up and distributes it between a number of different programs.

The router makes all the decisions about where email should go (including whether destination addresses are good at all). A router process is running all the time, scanning a work area for new messages to process. Structurally, the router is one half handler for basic operations (scanning for new messages, reading and writing control files, etc), and one half a script interpreter for a language modeled on the Unix shell with extensions for dealing with Lisp-like lists, making database and DNS lookups, and analyzing RFC 822 email addresses. Routing and address rewriting decisions are made by a script that the router program loads up; for each message, the router calls a series of functions that the script is expected to define. This puts a powerful programming language (plus the ability to run external programs) at the service of making routing decisions, while letting it deal with them at a high level.

The scheduler schedules the delivery of routed messages to destinations, which may be remote SMTP servers, local mailboxes (or files, or user-run filtering processes, or the like), and even pseudo-destinations that handle things like 'wait until the DNS lets you resolve this name'. The actual delivery is done by programs that the scheduler starts, called transport agents. The scheduler handles timing out messages that have sat around for too long without being delivered.

The smtpserver receives messages via SMTP and queues them up for the router to process. It operates as a normal Unix network server (like apache): the main smtpserver process sits there waiting for connections, and when it receives one it forks itself so the child handles the new client. Each smtpserver child process handles only one SMTP connection; when the connection ends, the child exits. The smtpserver is just one of a number of message submission programs; for example, another one is a replacement sendmail program for submitting messages on the local machine.

Complicating the smtpserver

This is a clean and simple design with the functionality logically split up, but suffers from a small problem. SMTP defines a few commands that allow a SMTP client to inquire whether an address is valid and what it expands to (the VRFY and EXPN commands). While the smtpserver needs to be able to support them, the router is the only thing that knows anything about interpreting and expanding addresses.

Zmailer takes the simple way out: to support these commands, the smtpserver starts a new copy of the router using a special interactive mode normally used for debugging. It then invokes a special function that the router scripts are expected to define with various options; the function is expected to print out properly formed response lines which the smtpserver will then pass back to the SMTP client. This has a certain amount of overhead but keeps life mostly simple. For efficiency, the smtpserver only starts the subordinate router when it needs to, when there's a SMTP command that needs the router's involvement, and it keeps the router running for until the smtpserver process itself exits.

Most of our complicated mail filtering is implemented by extensions to this mechanism: at every interesting event the smtpserver calls out to the subordinate router. The router's response lines are scrutinized to determine whether the operation was successful or not so the smtpserver can update its internal state (it does not allow RCPT TO before a successful MAIL FROM, for example). The current 'interesting events' are:

What implements each check

IP-level blocks are done through Linux's iptables kernel support. Unix shell scripts help maintain and dynamically update the blocks.

IP address greylisting and tcpwrappers are done directly in the smtpserver, because both work better there. (IP address greylisting has to be tracked by the central smtpserver process, because only it has persistent state across multiple connections.)

The smtpserver directly rejects known bad HELO name patterns and HELO names with bad characters (the latter check is hardcoded), without starting a subordinate router process.

Checking DNS blocklists, all other HELO processing, all RCPT TO, and all MAIL FROM processing except the rejection of domains with nameservers we don't like is handled by just the router. Nameservice checking is delegated to an external program written in C. Doing DNSBls in the router has enabled very flexible handling (and logging). Anti-relay checking is also entirely in the router.

Checking message bodies at DATA time is delegated by the router scripts to an external program written in Python. The smtpserver-spawned router instance invokes the program, telling it the file that's about to be queued to the real router process. If the process exits with a successful status and produced output, the output is passed to the SMTP client; otherwise, the router scripts report a 2xx code to accept the email. In the smtpserver, it watches the result code being sent to the SMTP client, and either queues the mail (if it is a 2xx response) or removes the pending mail file (if it is a 4xx or a 5xx response).

The ZMailer router message file format bundles envelope control information in special magic lines at the start of the file before the regular message headers and body. The smtpserver writes email to be sent to the router in this format, so the body checker (which is examining the about-to-be-sent file) has available to it not only the message headers but the envelope headers as well.

The structure of body checks

The body checking program is split into two separate chunks of code (plus a small wrapper): the mail parser and the filterer. The former handles message management; the latter makes the actual decisions.

The mail parser's main job is to turn the file argument into a message object, which holds the message's contents and information and supports a series of queries (such as 'what is the message's MIME structure') and operations (such as 'tell me if this regular expression occurs in the decoded text body'). The parser is only feasible because it gets to reuse existing Python modules for much of the hard work; HTML parsing, and MIME parsing and decoding, are all handled by existing, standard Python modules.

The mail filterer is handed a message object which it applies a series of tests to, signaling either rejection or acceptance during the process. It uses the functionality exposed by the message object (and some support routines exposed by the mail parser) to do this, plus whatever other Python functions it wants to invoke.

The separation of functionality keeps the mail filtering checks easy to follow, with a simple linear structure mostly composed of calls back to message object functions. In fact a good amount of the lines in the mail filtering module are data; bad web sites, regular expressions we're looking for, our danger zone of IP address ranges, and so on.

Important Win: Using Python

Python has proven an excellent language to write a complex and relatively demanding program in. Writing such a capable message body checker would not have been possible without the existence of a lot of support modules (everything from MIME handling to HTML parsing to BASE64 decoding). Without security boosts provided by its automatic handling of memory and the strong exception model, it might not have been possible to confidently deploy the program in a hostile environment. (Incoming email has to be assumed to be a hostile place.)

Python's good support for modularity and objects has made it simple to divide the program up in a structure that increases maintainability and clarity. At the same time, its flexibility and power has made it easy to write even relatively complex filtering rules, such as the check for web sites hosts being in the SBL.

While the body checker is a lot of code (almost 1300 lines of code, comments, data, and two support modules, not counting the Python distribution-supplied modules for MIME, HTML, etc), it still feels and reads to the author like a small, simple, easy to follow program. It was written rapidly and easily, pretty much without problems or surprises, and evolves smoothly.

Important Win: Flexible spam filtering systems

One strong lesson we've drawn from the evolution of our spam filtering is powerful filtering requires flexibility. You cannot get powerful filtering by hard coding various checks in something that is difficult to change, because both spam and your needs keep changing.

Because our spam filtering is coded almost entirely in easily written high-level languages, over and over we have been able to adopt and alter filters easily. Because our smtp server uses the simple and generic filtering interface it does (as inefficient as it is), it has been easy to implement very sophisticated spam filters and operations. Because it has been easy to implement powerful filters, they have been.

One consequence is that it has been easy to experiment with checks that may not work out (and it can be hard to predict in advance what checks are going to work out). Because writing a new check is easy and fast, we can afford to write them when we aren't sure that they will pay off. If they work, great; if they don't work, we have not wasted much time and effort. The best recent example is the 'web host IP in SBL' check, which has been far more beneficial than the author expected when he came up with the idea.

Code availability

Because we run a hacked version of an ancient mailer, we expect that our code is not terribly of interest to anyone else. If we're wrong, get in touch with us.

This document is being written (probably continuously) by Chris Siebenmann