dc.contributor.advisor Ordonez, Carlos dc.creator Garcia Alvarado, Carlos 1983- dc.date.accessioned 2014-03-13T21:29:21Z dc.date.available 2014-03-13T21:29:21Z dc.date.created December 2012 dc.date.issued 2012-12 dc.identifier.uri http://hdl.handle.net/10657/554 dc.description.abstract The velocity, variety, and volume of present-day data bring about new problems that Database Management Systems (DBMS) must handle. In particular, text data encapsulate the essence of the so-called unstructured data" and the need for efficient in-database algorithms and data structures for their analysis. Multiple solutions have been proposed for preprocessing, integrating, and analyzing heterogeneous sources via ad-hoc systems. This dissertation defends the idea that text corpora can be managed efficiently inside a relational database management system via SQL and database extensibility mechanisms. It presents data layouts and algorithms for preprocessing, storing, and querying text data efficiently within a DBMS. The optimizations focus on one-pass algorithms, pushing in-memory computations and reducing the number of I/Os. Furthermore, the DBDOC project introduces a new algorithm and properties for integrating and querying structured information stored in a relational database management system and the unstructured data living outside the DBMS realm. In a complementary manner, an original query recommendation algorithm based on OLAP cubes enhances the querying system for finding related concepts. The last chapter of this dissertation is based on exploring heterogeneous data analysis in the form of text corpora stored within a DBMS. The results of this research are a couple of original OLAP-based algorithms for extracting knowledge from text corpora. ONTOCUBE and CUBO focus on extracting and generating OLAP cubes from ontologies, respectively. The distinguishing trait of both algorithms is that they exploit the sparse nature of text data and perform efficient frequency summarizations. In addition, ONTOCUBE presents new measurements for building ontologies using OLAP cubes and CUBO formalizes the notation to map the hierarchy behind an ontology to compute multidimensional aggregations on classified documents. Finally, all of these document data exploration algorithms are then refined and adapted for exploring source code files that reference a database schema. This dissertation concludes with important open problems. dc.format.mimetype application/pdf dc.language.iso eng dc.rights The author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s). dc.subject DBMS dc.subject Documents dc.subject OLAP dc.subject Source Code dc.subject.lcsh Computer science dc.title Integrating and Analyzing Databases and Interrelated Documents dc.date.updated 2014-03-13T21:29:26Z dc.type.genre Thesis thesis.degree.name Doctor of Philosophy thesis.degree.level Doctoral thesis.degree.discipline Computer Science thesis.degree.grantor University of Houston thesis.degree.department Computer Science, Department of dc.contributor.committeeMember Subhlok, Jaspal dc.contributor.committeeMember Pâris, Jehan-François dc.contributor.committeeMember Gnawali, Omprakash dc.contributor.committeeMember Andrews, Richard dc.type.dcmi Text dc.format.digitalOrigin born digital thesis.degree.major Database Systems dc.description.department Computer Science, Department of thesis.degree.college College of Natural Sciences and Mathematics
