Neural Sequence Labeling on Social Media Text

Aguilar, Gustavo

Neural Sequence Labeling on Social Media Text

dc.contributor.advisor	Solorio, Thamar
dc.contributor.committeeMember	Diab, Mona
dc.contributor.committeeMember	Kakadiaris, Ioannis A.
dc.contributor.committeeMember	Verma, Rakesh M.
dc.creator	Aguilar, Gustavo
dc.creator.orcid	0000-0002-3028-7626
dc.date.accessioned	2021-04-08T16:27:48Z
dc.date.available	2021-04-08T16:27:48Z
dc.date.created	December 2020
dc.date.issued	2020-12
dc.date.submitted	December 2020
dc.date.updated	2021-04-08T16:27:53Z
dc.description.abstract	As social media (SM) brings opportunities to study societies across the world, it also brings a variety of challenges to automate the processing of SM language. In particular, most of the textual content in SM is considered noisy; it does not always stick to the rules of the written language, and it tends to have misspellings, arbitrary abbreviations, orthographic inconsistencies, and flexible grammar. Additionally, SM platforms provide a unique space for multilingual content. This polyglot environment requires modern systems to adapt to a diverse range of languages, imposing another linguistic barrier to processing and understanding of text from SM domains. This thesis aims at providing novel sequence labeling approaches to handle noise and linguistic code-switching (i.e., the alternation of languages in the same utterance) in SM text. In particular, the first part of this thesis focuses on named entity recognition for English SM text, where I propose linguistically-inspired methods to address phonological writing and flexible syntax. Besides, I investigate whether the performance of current state-of-the-art models relies on memorization or contextual generalization of entities. In the second part of this thesis, I focus on three sequence labeling tasks for code-switched SM text: language identification, part-of-speech tagging, and named entity recognition. Specifically, I propose transfer learning methods from state-of-the-art monolingual and multilingual models, such as ELMo and BERT, to the code-switching setting for sequence labeling. These methods reduce the demand for code-switching annotations and resources while exploiting multilingual knowledge from large pre-trained unsupervised models. The methods presented in this thesis are meant to benefit higher-level NLP applications oriented to social media domains, including but not limited to question-answering, conversational systems, and information extraction.
dc.description.department	Computer Science, Department of
dc.format.digitalOrigin	born digital
dc.format.mimetype	application/pdf
dc.identifier.citation	Portions of this document appear in: Patwa, Parth, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and Amitava Das. "Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets." arXiv e-prints (2020): arXiv-2008.; Aguilar, Gustavo, Sudipta Kar, and Thamar Solorio. "LinCE: A centralized benchmark for linguistic code-switching evaluation." arXiv preprint arXiv:2005.04322 (2020).; Aguilar, Gustavo, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, and Chenlei Guo. 2020. “Knowledge Distillation from Internal Representations”. Proceedings of the AAAI Conference on Artificial Intelligence 34 (05):7350-57. https://doi.org/10.1609/aaai.v34i05.6229.; Aguilar, Gustavo, A. Pastor López-Monroy, Fabio A. González, and Thamar Solorio. "Modeling noisiness to recognize named entities using multitask neural networks on social media." arXiv preprint arXiv:1906.04129 (2019).; Aguilar, Gustavo, Suraj Maharjan, Adrian Pastor López-Monroy, and Thamar Solorio. "A multi-task approach for named entity recognition in social media data." arXiv preprint arXiv:1906.04135 (2019).; Aguilar, Gustavo, Viktor Rozgić, Weiran Wang, and Chao Wang. "Multimodal and multi-view models for emotion recognition." arXiv preprint arXiv:1906.10198 (2019).; Aguilar, Gustavo, and Thamar Solorio. "From English to Code-Switching: Transfer Learning with Strong Morphological Clues." arXiv preprint arXiv:1909.05158 (2019).
dc.identifier.uri	https://hdl.handle.net/10657/7726
dc.language.iso	eng
dc.rights	The author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. UH Libraries has secured permission to reproduce any and all previously published materials contained in the work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subject	sequence labeling
dc.subject	social media
dc.subject	neural networks
dc.subject	noisy text
dc.title	Neural Sequence Labeling on Social Media Text
dc.type.dcmi	Text
dc.type.genre	Thesis
thesis.degree.college	College of Natural Sciences and Mathematics
thesis.degree.department	Computer Science, Department of
thesis.degree.discipline	Computer Science
thesis.degree.grantor	University of Houston
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: AGUILAR-DISSERTATION-2020.pdf
Size:: 15.61 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 2 of 2

Name:: PROQUEST_LICENSE.txt
Size:: 4.43 KB
Format:: Plain Text
Description:

Download

Name:: LICENSE.txt
Size:: 1.82 KB
Format:: Plain Text
Description:

Download

Collections

Published ETD Collection