Linguistic Analysis, Data Mining, and Clustering to Predict Document Age

Freeman, Keegan

Linguistic Analysis, Data Mining, and Clustering to Predict Document Age

dc.contributor	Shah, Shishir Kirit
dc.contributor.author	Freeman, Keegan
dc.date.accessioned	2021-02-23T16:22:25Z
dc.date.available	2021-02-23T16:22:25Z
dc.date.issued	2019
dc.description.abstract	The goal of this research is to interpret enough linguistic data to provide a basis for which clustering, and time prediction can be performed. With knowledge of how linguistic structure changes over time, conclusions can be drawn about how communication has and will continue to change. The data collected throughout the 250 raw text sources is as follows: word stem tally, word stem percentage, part of speech tally, part of speech percentage, scored part of speech collocations through both bigrams and trigrams, and average sentence length. Certain suitable metrics were chosen from this linguistic data totaling 66 dimensions for each source “point.” From these 66 point dimensions, filtering was performed to deem which chosen metrics are unique and therefore will be proper indicators of a source’s time-period. This filtering process was performed using an algorithm that determined which metrics over the given time period have a slope within a 1.15 ratio of one another for 90% of the time. Metrics with similar slopes are redundant and therefore their dimensions can be removed from the source points as well as those whose slopes are 0. Using the elbow method and clustering, each source point is designated to a given cluster which should represent various time-spans between 1900 and 2000. With this data, a confusion matrix can be displayed to indicate the success in correctly identifying a sources individual time span. Due to the limited corpora used, prediction beyond precision of an indiscriminate projection has not yet been achieved.
dc.description.department	Electrical and Computer Engineering, Department of
dc.description.department	Honors College
dc.identifier.uri	https://hdl.handle.net/10657/7527
dc.language.iso	en_US
dc.relation.ispartof	Summer Undergraduate Research Fellowship
dc.rights	The author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.title	Linguistic Analysis, Data Mining, and Clustering to Predict Document Age
dc.type	Poster

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Freeman_Keegan_URD2019.pdf
Size:: 1.9 MB
Format:: Adobe Portable Document Format

Download

Collections

Undergraduate Research Day Student Projects