Content and Stylistic Models for Authorship, Stance, and Hyperpartisan Detection
This dissertation presents content and stylistic solutions for three opinion-oriented text classification problems. It explores user-generated text data to find how individuals write through authorship identification, express their opinion via stance detection, and articulate news while belonging to the left or right political party using hyperpartisan news detection. In the first problem, this research studies the case of deception detection in online reviews. It compares the distribution of structural features of the text using KL-Divergence to find the most discriminative elements of an individual's writing style. Then, it proposes a transductive algorithm to learn from unlabeled test data to expand the training set. Following that, it focuses on authorship verification for document pairs with different topics, genres, or both by presenting a neural network model with parallel recurrent layers and a fusion mechanism that compares the language of the two documents. The model is examined on datasets of multiple domains, including multi-topics multi-genre PAN datasets, Amazon reviews, and a dataset of machine learning articles. According to the experimental results, the model achieves stable and competitive performance compared to the baselines. Finally, a hierarchical version of the network with two layers of attention is designed for detecting writing style change within a text document. The model takes the structural features of a sentence to observe the transitions of writing style. Experimental evaluation on PAN 2018 dataset confirms our previous finding of the effectiveness of structural elements in representing writing style. In the second problem, this research works on identifying the stance of argumentative opinion, a novel application of opinion mining. Its proposed data consists of arguments represented in nonpartisan format. While it is acknowledged that accurate information from both sides of the contemporary issues is an `antidote in confirmation bias' and such information helps the society to improve critical thinking and open-mindedness, it is relatively rare and hard to find online. With the well-researched non-biased arguments on controversial issues shared by Procon.org, detecting the stance of arguments is a crucial step to automate organizing such resources. To address this, it employs a universal pretrained language model with a weight-dropped LSTM neural network to leverage the context of an argument for finding the argument's stance. The analysis shows the strength of pretraining and the ability of the model to find the stance of long arguments through the entire documents using pooling operations. Finally, this dissertation provides an approach to see if the latent personality features in individuals' writing can be useful in the three opinion-oriented classification tasks. The approach deploys the state-of-the-art deep bidirectional transformer to extract Myers-Briggs personality type from user posts. The posts are collected from Reddit, Twitter, and a personality forum with the self-reported personality type by the users. Then, it induces personality information from its proposed transformer-based model and combines the information with some other classification models. Experimental evidence shows the effectiveness of personality information in authorship verification, stance detection of arguments, and hyperpartisan news detection after topic-based sub-sampling of the news training data.