By Gordon V. Cormack, David R. Cheriton School of Computer Science, University of Waterloo, Canada, gvcormac@uwaterloo.ca
Spam is information crafted to be delivered to a large number of recipients, in spite of their wishes. A spam filter is an automated tool to recognize spam so as to prevent its delivery. The purposes of spam and spam filters are diametrically opposed: spam is effective if it evades filters, while a filter is effective if it recognizes spam. The circular nature of these definitions, along with their appeal to the intent of sender and recipient make them difficult to formalize. A typical email user has a working definition no more formal than "I know it when I see it." Yet, current spam filters are remarkably effective, more effective than might be expected given the level of uncertainty and debate over a formal definition of spam, more effective than might be expected given the state-of-the-art information retrieval and machine learning methods for seemingly similar problems. But are they effective enough? Which are better? How might they be improved? Will their effectiveness be compromised by more cleverly crafted spam?
We survey current and proposed spam filtering techniques with particular emphasis on how well they work. Our primary focus is spam filtering in email; Similarities and differences with spam filtering in other communication and storage media — such as instant messaging and the Web — are addressed peripherally. In doing so we examine the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well-known methods are detailed sufficiently to make the exposition self-contained, however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures, and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures, and methods for evaluating spam filters are still evolving. We survey these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. We outline several uncertainties and propose experimental methods to address them.
Email Spam Filtering: A Systematic Review surveys current and proposed spam filtering techniques with particular emphasis on how well they work. The primary focus is on spam filtering in email, while similarities and differences with spam filtering in other communication and storage media - such as instant messaging and the Web - are addressed peripherally. Email Spam Filtering: A Systematic Review examines the definition of spam, the user's information requirements and the role of the spam filter as one component of a large and complex information universe. Well known methods are detailed sufficiently to make the exposition self-contained; however, the focus is on considerations unique to spam. Comparisons, wherever possible, use common evaluation measures and control for differences in experimental setup. Such comparisons are not easy, as benchmarks, measures and methods for evaluating spam filters are still evolving. The author surveys these efforts, their results and their limitations. In spite of recent advances in evaluation methodology, many uncertainties (including widely held but unsubstantiated beliefs) remain as to the effectiveness of spam filtering techniques and as to the validity of spam filter evaluation methods. Email Spam Filtering: A Systematic Review outlines several uncertainties and proposes experimental methods to address them. Email Spam Filtering: A Systematic Review is a highly recommended read for anyone conducting research in the area or charged with controlling spam in a corporate environment.