Automatic Feature Extraction for Phishing Email Detection



Journal Title

Journal ISSN

Volume Title



Each year, billions are lost in damages from phishing emails, and human researchers put countless hours researching new discovery techniques and finding the flaws in the old ones. The number of articles publishing these findings are increasing very rapidly, too rapidly for humans to assimilate and remember in a reasonable amount of time. This thesis adapts FeatureSmith's automatic feature extraction for Android malware detection for phishing email detection, to automatically extract all the features in each scholarly article, patent, and thesis. Because of the nature of a phishing email, which requires intelligent application of multiple features for accurate classification, the weighting and ranking utilized by FeatureSmith for Android to find the best features, was not as effective for phishing email. As a result the final, most helpful, features must then be manually extracted from the automatic explanations to use in phishing email detection. Sometimes the extraction process involves going to the source article, which can reveal tables, or other sources of overlooked features that can also be implemented. In total, 75 final features, both binary and discrete, were manually extracted. Implementing these features using Machine Learning, with intuition's aide, for phishing email classification, resulted in 94.6% detection accuracy, using an unbalanced dataset with separate training and testing emails obtained from the Anti Phishing Shared Pilot at International Workshop on Security and Privacy Analytics. The top reported testing accuracy for this dataset is 96.8% detection accuracy.



Phishing detection, Phishing, Automatic feature extraction