Detecting Phish Using Website Content and URL N-Gram Features



Journal Title

Journal ISSN

Volume Title



Phishing websites are websites that attempt to steal login credentials or other confidential information from Internet users. They are ubiquitous, and their impact and prevalence only increases with time.

In order to counteract this threat, many approaches have been attempted. This work attacks the problem of detecting phishing websites in several key ways. First, this thesis continues the work of a previous heuristic based approach to solving this problem by converting many of its proposed heuristics and filtering methods into features for a machine learning classifier, and by improving upon its website collection method by reducing the inherent bias in its legitimate URL set. Second, this thesis adds novel features. They are: how often each URL N-gram occurs in each URL, the Shannon entropy of the URL, and N-gram based website similarity metrics.

The URL N-gram and Shannon Entropy features differentiate legitimate and phishing URLs by taking advantage of fundamental differences between the two. The N-gram based websites similarity features take advantage of the fact that phishing websites are often copies of popular legitimate websites. Also, because this set of similarity features gives a degree to which they are similar, it provides the added benefit of being an indirect proxy feature for the contents of the webpage. The methods mentioned above are used to obtain three particularly interesting classification results. One provides a 98.9% accuracy with an F1-Score of 98.6%. Another provides a 98.4% accuracy, but it is trained in 1 minute and 14 seconds (as opposed to 22 hours). Finally, another one managed to have a 100% True Negative Rate, which makes it ideal for a preprocessing filter.



Phish, URL, Lexical, Webpage Content, Machine learning, N-gram