Linguistic Analysis, Data Mining, and Clustering to Predict Document Age

dc.contributorShah, Shishir Kirit
dc.contributor.authorFreeman, Keegan
dc.date.accessioned2021-02-23T16:22:25Z
dc.date.available2021-02-23T16:22:25Z
dc.date.issued2019
dc.description.abstractThe goal of this research is to interpret enough linguistic data to provide a basis for which clustering, and time prediction can be performed. With knowledge of how linguistic structure changes over time, conclusions can be drawn about how communication has and will continue to change. The data collected throughout the 250 raw text sources is as follows: word stem tally, word stem percentage, part of speech tally, part of speech percentage, scored part of speech collocations through both bigrams and trigrams, and average sentence length. Certain suitable metrics were chosen from this linguistic data totaling 66 dimensions for each source “point.” From these 66 point dimensions, filtering was performed to deem which chosen metrics are unique and therefore will be proper indicators of a source’s time-period. This filtering process was performed using an algorithm that determined which metrics over the given time period have a slope within a 1.15 ratio of one another for 90% of the time. Metrics with similar slopes are redundant and therefore their dimensions can be removed from the source points as well as those whose slopes are 0. Using the elbow method and clustering, each source point is designated to a given cluster which should represent various time-spans between 1900 and 2000. With this data, a confusion matrix can be displayed to indicate the success in correctly identifying a sources individual time span. Due to the limited corpora used, prediction beyond precision of an indiscriminate projection has not yet been achieved.
dc.description.departmentElectrical and Computer Engineering, Department of
dc.description.departmentHonors College
dc.identifier.urihttps://hdl.handle.net/10657/7527
dc.language.isoen_US
dc.relation.ispartofSummer Undergraduate Research Fellowship
dc.rightsThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.titleLinguistic Analysis, Data Mining, and Clustering to Predict Document Age
dc.typePoster

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Freeman_Keegan_URD2019.pdf
Size:
1.9 MB
Format:
Adobe Portable Document Format