Linguistic Analysis, Data Mining, and Clustering to Predict Document Age



Journal Title

Journal ISSN

Volume Title



The goal of this research is to interpret enough linguistic data to provide a basis for which clustering, and time prediction can be performed. With knowledge of how linguistic structure changes over time, conclusions can be drawn about how communication has and will continue to change. The data collected throughout the 250 raw text sources is as follows: word stem tally, word stem percentage, part of speech tally, part of speech percentage, scored part of speech collocations through both bigrams and trigrams, and average sentence length. Certain suitable metrics were chosen from this linguistic data totaling 66 dimensions for each source “point.” From these 66 point dimensions, filtering was performed to deem which chosen metrics are unique and therefore will be proper indicators of a source’s time-period. This filtering process was performed using an algorithm that determined which metrics over the given time period have a slope within a 1.15 ratio of one another for 90% of the time. Metrics with similar slopes are redundant and therefore their dimensions can be removed from the source points as well as those whose slopes are 0. Using the elbow method and clustering, each source point is designated to a given cluster which should represent various time-spans between 1900 and 2000. With this data, a confusion matrix can be displayed to indicate the success in correctly identifying a sources individual time span. Due to the limited corpora used, prediction beyond precision of an indiscriminate projection has not yet been achieved.