Here are the ten broad steps to build a spam catcher
- Get a sample of emails that are known to be spam or not spam. Split the sample 60:20:20 to provide a “training” set, a “cross-validation” set and a “test” set.
- Turn each email into a list of words by
- Stripping out headers (if not part of the spam test) and other redundancies
- Running NLP software to record the stem of a word only (for example, record city and cities as cit)
- Count the number of times each unique word appears in the sample and order the list so that we can use the top 100 or 10 000 or 50 000 (whatever) to check for spam. Remember to use stemmed words!
- Convert the list of words in each email into a list of look-up numbers by substituting the row number of the word from the dictionary we made in Step 3.
- For each email, make another list where row 1 is 1 if the first word in the dictionary is present in the email, where row 2 is the 1 if the second word in the dictionary is present in the email. If the word is not present, leave the value for that row as zero. You should now have as many lists are you have emails each with as many rows as you have words in your spam dictionary.
- Run a SVM algorithm to predict whether each email is spam (1) or not spam (0). The input is the list of 1s and 0s indicating which words are present in the email.
- Compare the predictions with the know values and compute the percentage correct.
- Compute the predictions on the cross-validation set and tweak the algorithm depending on whether the cross-validation accuracy is too similar to the training accuracy (suggesting the model could be stronger) or too dissimilar (suggesting the model is too strong).
- Find the words most associated with spam.
- Repeat as required.