Textractor: A New Approach for N-Gram, Collocation and Multi-Word Expression Extraction

dc.contributor.advisorVerma, Rakesh M.
dc.contributor.committeeMemberCheng, Kam-Hoi
dc.contributor.committeeMemberRamchand, Latha
dc.creatorVuppuluri, Vasanthi 1991-
dc.date.accessioned2020-01-03T23:19:19Z
dc.date.createdMay 2015
dc.date.issued2015-05
dc.date.submittedMay 2015
dc.date.updated2020-01-03T23:19:19Z
dc.description.abstractThere is so much knowledge available on the Internet now, which represents a great opportunity for automatic, intelligent text processing and understanding, but there are major problems in finding legitimate sources of information and overcoming rate limitations on search engine APIs. The work in this thesis describes methods that combine the knowledge of World Wide Web (WWW) and the power of Internet search with the knowledge extracted from dictionaries. This thesis presents Textractor, an un-supervised, domain independent general-purpose n-gram, collocation and multi- word expression (MWE) extraction software written in Python. It is modular and allows the user to choose from and compare different methods for identifying n- grams, collocations and MWEs including statistical, dictionary and Internet-based. This thesis shows that it is very hard to identify collocations based on statistical information from the given text document alone (although this might seem obvious some systems do use it), and that dictionary and Internet-based techniques, when combined properly, can be very effective sources of collocations and MWEs without their respective drawbacks. This method can overcome the limitations of current Natural Language Processing techniques. For example, Textractor can recognize collocations and MWEs even when the complete sentence is not present, and when the domain knowledge of the data is not known. It is currently designed to work with text in English but can easily be extended to other languages.
dc.description.departmentComputer Science, Department of
dc.format.digitalOriginborn digital
dc.format.mimetypeapplication/pdf
dc.identifier.urihttps://hdl.handle.net/10657/5693
dc.language.isoeng
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subjectN-grams
dc.subjectCollocations
dc.subjectMWE
dc.titleTextractor: A New Approach for N-Gram, Collocation and Multi-Word Expression Extraction
dc.type.dcmiText
dc.type.genreThesis
local.embargo.lift2020-05-01
local.embargo.terms2020-05-01
thesis.degree.collegeCollege of Natural Sciences and Mathematics
thesis.degree.departmentComputer Science, Department of
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.levelMasters
thesis.degree.nameMaster of Science

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
VUPPULURI-THESIS-2015.pdf
Size:
308.88 KB
Format:
Adobe Portable Document Format

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
LICENSE.txt
Size:
1.82 KB
Format:
Plain Text
Description: