Textractor: A New Approach for N-Gram, Collocation and Multi-Word Expression Extraction

Date

2015-05

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

There is so much knowledge available on the Internet now, which represents a great opportunity for automatic, intelligent text processing and understanding, but there are major problems in finding legitimate sources of information and overcoming rate limitations on search engine APIs. The work in this thesis describes methods that combine the knowledge of World Wide Web (WWW) and the power of Internet search with the knowledge extracted from dictionaries. This thesis presents Textractor, an un-supervised, domain independent general-purpose n-gram, collocation and multi- word expression (MWE) extraction software written in Python. It is modular and allows the user to choose from and compare different methods for identifying n- grams, collocations and MWEs including statistical, dictionary and Internet-based. This thesis shows that it is very hard to identify collocations based on statistical information from the given text document alone (although this might seem obvious some systems do use it), and that dictionary and Internet-based techniques, when combined properly, can be very effective sources of collocations and MWEs without their respective drawbacks. This method can overcome the limitations of current Natural Language Processing techniques. For example, Textractor can recognize collocations and MWEs even when the complete sentence is not present, and when the domain knowledge of the data is not known. It is currently designed to work with text in English but can easily be extended to other languages.

Description

Keywords

N-grams, Collocations, MWE

Citation