Trend Analysis on Phishing Email Data Using Natural Language Processing


Online phishing email attacks have been increasingly causing financial losses to the organizations and recently, the frequency of such attacks have been observed to increase. This thesis attempts to find the patterns inherent to phishing attacks using natural language processing methodologies. It is quite a challenging task to separate out the phishing attack emails from the legitimate ones, thanks to the high degree of similarity between syntax of both sets of emails. Fortunately, there are plenty of methodologies in natural language processing which have shown promising results in categorizing text based on its grammar. In this context, this thesis adapts three similarity measures: Cosine similarity, Jaccard coefficient, and Euclidean distance to cluster the emails. An attempt has been made to analyze the trend in the vocabulary and grammatical syntax of phishing attack emails. Moreover, by utilizing software provided by The Stanford Natural Language Processing Group, we attempt to find the most frequent linguistic patterns in the phishing attacks that can be used to further separate out such emails from the legitimate ones.



Natural Language Processing, Document similarity, Phishing detection, Phishing, Document clustering