Proactive Defense through Automated Attack Generation: A Multi-pronged Study of Generated Deceptive Content

Date

2020-12

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Social engineering attacks are a security threat - attacks like phishing, email masquerading, etc. are common examples of such attacks where a perpetrator impersonates as a legitimate entity to steal an unknowing victim's digital identity. However, despite having a higher probability of success, executing such an attack can be costly in terms of time and manual labor. With the advancements in machine learning and natural language processing techniques, the attackers can now use more sophisticated methods to evade detection. Deep neural learners are capable of natural text generation when trained on huge amounts of written textual content. While these techniques have been tested in creative content (stories) generation based tasks, they have been abused to generate fake content (fake news) as well. In a proactive scenario, the defender presumes that attackers would resort to sophisticated yet automated methods of attack vector generation. However, the application of neural text generation methods to email generation is fairly challenging owing to the presence of noise or sparsity in emails and the diversity in email writing style. Moreover, the evaluation and detection of generated content is a challenging and cumbersome task and current automated metrics do not provide the best possible alternative. We analyze the task of automated content generation for two tasks: (a) creative content or story generation from writing prompts; and (b) generation of emails from given subject prompts for specific intents. We split the proposed analysis for each task into three defined parts - (i) content (story/email) generation; (ii) fine-tuning and improving upon generated content; and (iii) content evaluation. Apart from testing the baselines like word-based Recurrent Neural Networks and pre-trained and fine-tuned transformer language models, we propose HiGen - a hierarchical architecture that leverages the use of a generative language model by improving upon the generated content with the use of sentence embeddings given a prior conditioning prompt. Finally, we compare the linguistic quality of the generated text to human authored text using a set of automated metrics. We also corroborate our findings with a human-based user study - to ascertain how well the metrics can distinguish between writing patterns. Moreover, we explore if there exists a difference in system performance with respect to the genre of text generation - story vs. emails. We see the overall improvement in sentence coherence in content generated by HiGen architecture.

Description

Keywords

Natural Language Generation, Deep Learning, Transformer Architecture, Deep Neural Network, Email Generation, Story Generation, Language Modeling, Coherenece Metrics

Citation

Portions of this document appear in: Das, Avisha, and Rakesh M. Verma. "Can machines tell stories? A comparative study of deep neural language models and metrics." IEEE Access 8 (2020): 181258-181292. And in: Das, Avisha, and Rakesh Verma. "Automated email generation for targeted attacks using natural language." arXiv preprint arXiv:1908.06893 (2019).