Authorship Attribution for Realistic Scenarios



Journal Title

Journal ISSN

Volume Title



A majority of the previous works on authorship attribution make several assumptions while designing their problem. They assume that the candidate author set size is small and that documents of substantial length are available for each author. Also, they only consider a single genre scenario where texts with known authorship are of the same topic and genre as the text for which we are trying to perform attribution. In today's world, where most communication happens online, the text is likely to be short and the anonymity that social media offers makes it hard to narrow down the candidate authors. Moreover, for domains such as emails, we might not be able to garner in-domain data, and thus we need to be able to use data from more readily available sources such as tweets and reviews. We devise a more practical, albeit challenging, problem that is closely aligned with possible real-world authorship attribution problems. We consider short documents, a long list of possible authors, and the ability to leverage datasets from any available domain. In this work, we build neural network based models that create a well-rounded representation of the input text. A good representation of the text must be able to catch the smallest of signals present in it that can point towards the author. Only a model that can accomplish this can work for short texts while also being fairly robust to an increasing number of authors. Our results show that we were indeed successful in building such models. Our cross-domain representations are capable of distilling out the topic-specific attributes of the text such that what remains is purely owing to an author's style. This ensures that the attribution performance does not degrade when we move from in-domain data to cross-domain data. It is essential for authorship attribution methods to work for realistic scenarios, even though this adds more complexity to the task. We find that it is indeed possible to create methods that can perform well even in these challenging situations.



Authorship attribution, Domain adaptation, Deep learning, Representation learning, Embeddings, CNN for NLP