Authorship Attribution for Realistic Scenarios

dc.contributor.advisorSolorio, Thamar
dc.contributor.committeeMemberGonzalez, Fabio A.
dc.contributor.committeeMemberRosso, Paolo
dc.contributor.committeeMemberEick, Christoph F.
dc.contributor.committeeMemberVerma, Rakesh M.
dc.creatorShrestha, Prasha 1987-
dc.date.accessioned2018-11-30T17:14:17Z
dc.date.available2018-11-30T17:14:17Z
dc.date.createdMay 2018
dc.date.issued2018-05
dc.date.submittedMay 2018
dc.date.updated2018-11-30T17:14:17Z
dc.description.abstractA majority of the previous works on authorship attribution make several assumptions while designing their problem. They assume that the candidate author set size is small and that documents of substantial length are available for each author. Also, they only consider a single genre scenario where texts with known authorship are of the same topic and genre as the text for which we are trying to perform attribution. In today's world, where most communication happens online, the text is likely to be short and the anonymity that social media offers makes it hard to narrow down the candidate authors. Moreover, for domains such as emails, we might not be able to garner in-domain data, and thus we need to be able to use data from more readily available sources such as tweets and reviews. We devise a more practical, albeit challenging, problem that is closely aligned with possible real-world authorship attribution problems. We consider short documents, a long list of possible authors, and the ability to leverage datasets from any available domain. In this work, we build neural network based models that create a well-rounded representation of the input text. A good representation of the text must be able to catch the smallest of signals present in it that can point towards the author. Only a model that can accomplish this can work for short texts while also being fairly robust to an increasing number of authors. Our results show that we were indeed successful in building such models. Our cross-domain representations are capable of distilling out the topic-specific attributes of the text such that what remains is purely owing to an author's style. This ensures that the attribution performance does not degrade when we move from in-domain data to cross-domain data. It is essential for authorship attribution methods to work for realistic scenarios, even though this adds more complexity to the task. We find that it is indeed possible to create methods that can perform well even in these challenging situations.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.format.mimetypeapplication/pdf
dc.identifier.urihttp://hdl.handle.net/10657/3472
dc.language.isoeng
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectAuthorship attribution
dc.subjectDomain adaptation
dc.subjectDeep learning
dc.subjectRepresentation learning
dc.subjectEmbeddings
dc.subjectCNN for NLP
dc.titleAuthorship Attribution for Realistic Scenarios
dc.type.dcmiText
dc.type.genreThesis
local.embargo.lift2020-05-01
local.embargo.terms2020-05-01
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy

Files

Original bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
SHRESTHA-DISSERTATION-2018.pdf
Size:
741.87 KB
Format:
Adobe Portable Document Format
No Thumbnail Available
Name:
prasha-thesis.zip
Size:
675.46 KB
Format:
Unknown data format

License bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
PROQUEST_LICENSE.txt
Size:
4.43 KB
Format:
Plain Text
Description:
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.82 KB
Format:
Plain Text
Description: