Textractor: A New Approach for N-Gram, Collocation and Multi-Word Expression Extraction

Vuppuluri, Vasanthi 1991-

Textractor: A New Approach for N-Gram, Collocation and Multi-Word Expression Extraction

dc.contributor.advisor	Verma, Rakesh M.
dc.contributor.committeeMember	Cheng, Kam-Hoi
dc.contributor.committeeMember	Ramchand, Latha
dc.creator	Vuppuluri, Vasanthi 1991-
dc.date.accessioned	2020-01-03T23:19:19Z
dc.date.created	May 2015
dc.date.issued	2015-05
dc.date.submitted	May 2015
dc.date.updated	2020-01-03T23:19:19Z
dc.description.abstract	There is so much knowledge available on the Internet now, which represents a great opportunity for automatic, intelligent text processing and understanding, but there are major problems in finding legitimate sources of information and overcoming rate limitations on search engine APIs. The work in this thesis describes methods that combine the knowledge of World Wide Web (WWW) and the power of Internet search with the knowledge extracted from dictionaries. This thesis presents Textractor, an un-supervised, domain independent general-purpose n-gram, collocation and multi- word expression (MWE) extraction software written in Python. It is modular and allows the user to choose from and compare different methods for identifying n- grams, collocations and MWEs including statistical, dictionary and Internet-based. This thesis shows that it is very hard to identify collocations based on statistical information from the given text document alone (although this might seem obvious some systems do use it), and that dictionary and Internet-based techniques, when combined properly, can be very effective sources of collocations and MWEs without their respective drawbacks. This method can overcome the limitations of current Natural Language Processing techniques. For example, Textractor can recognize collocations and MWEs even when the complete sentence is not present, and when the domain knowledge of the data is not known. It is currently designed to work with text in English but can easily be extended to other languages.
dc.description.department	Computer Science, Department of
dc.format.digitalOrigin	born digital
dc.format.mimetype	application/pdf
dc.identifier.uri	https://hdl.handle.net/10657/5693
dc.language.iso	eng
dc.rights	The author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subject	N-grams
dc.subject	Collocations
dc.subject	MWE
dc.title	Textractor: A New Approach for N-Gram, Collocation and Multi-Word Expression Extraction
dc.type.dcmi	Text
dc.type.genre	Thesis
local.embargo.lift	2020-05-01
local.embargo.terms	2020-05-01
thesis.degree.college	College of Natural Sciences and Mathematics
thesis.degree.department	Computer Science, Department of
thesis.degree.discipline	Computer Science
thesis.degree.grantor	University of Houston
thesis.degree.level	Masters
thesis.degree.name	Master of Science

Files

Original bundle

Now showing 1 - 1 of 1

Name:: VUPPULURI-THESIS-2015.pdf
Size:: 308.88 KB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: LICENSE.txt
Size:: 1.82 KB
Format:: Plain Text
Description:

Download

Collections

Published ETD Collection