• Login
    View Item 
    •   Repository Home
    • Electronic Theses and Dissertations
    • Published ETD Collection
    • View Item
    •   Repository Home
    • Electronic Theses and Dissertations
    • Published ETD Collection
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Neural Sequence Labeling on Social Media Text

    Thumbnail
    View/Open
    AGUILAR-DISSERTATION-2020.pdf (15.61Mb)
    Date
    2020-12
    Author
    Aguilar, Gustavo
    0000-0002-3028-7626
    Metadata
    Show full item record
    Abstract
    As social media (SM) brings opportunities to study societies across the world, it also brings a variety of challenges to automate the processing of SM language. In particular, most of the textual content in SM is considered noisy; it does not always stick to the rules of the written language, and it tends to have misspellings, arbitrary abbreviations, orthographic inconsistencies, and flexible grammar. Additionally, SM platforms provide a unique space for multilingual content. This polyglot environment requires modern systems to adapt to a diverse range of languages, imposing another linguistic barrier to processing and understanding of text from SM domains. This thesis aims at providing novel sequence labeling approaches to handle noise and linguistic code-switching (i.e., the alternation of languages in the same utterance) in SM text. In particular, the first part of this thesis focuses on named entity recognition for English SM text, where I propose linguistically-inspired methods to address phonological writing and flexible syntax. Besides, I investigate whether the performance of current state-of-the-art models relies on memorization or contextual generalization of entities. In the second part of this thesis, I focus on three sequence labeling tasks for code-switched SM text: language identification, part-of-speech tagging, and named entity recognition. Specifically, I propose transfer learning methods from state-of-the-art monolingual and multilingual models, such as ELMo and BERT, to the code-switching setting for sequence labeling. These methods reduce the demand for code-switching annotations and resources while exploiting multilingual knowledge from large pre-trained unsupervised models. The methods presented in this thesis are meant to benefit higher-level NLP applications oriented to social media domains, including but not limited to question-answering, conversational systems, and information extraction.
    URI
    https://hdl.handle.net/10657/7726
    Collections
    • Published ETD Collection

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    TDL
    Theme by 
    Atmire NV
     

     

    Browse

    All of DSpaceCommunities & CollectionsBy Issue DateAuthorsDepartmentsTitlesSubjectsThis CollectionBy Issue DateAuthorsDepartmentsTitlesSubjects

    My Account

    Login

    DSpace software copyright © 2002-2016  DuraSpace
    Contact Us | Send Feedback
    TDL
    Theme by 
    Atmire NV