Integrating and Analyzing Databases and Interrelated Documents

Garcia Alvarado, Carlos 1983-

Integrating and Analyzing Databases and Interrelated Documents

dc.contributor.advisor	Ordonez, Carlos
dc.contributor.committeeMember	Subhlok, Jaspal
dc.contributor.committeeMember	Pâris, Jehan-François
dc.contributor.committeeMember	Gnawali, Omprakash
dc.contributor.committeeMember	Andrews, Richard
dc.creator	Garcia Alvarado, Carlos 1983-
dc.date.accessioned	2014-03-13T21:29:21Z
dc.date.available	2014-03-13T21:29:21Z
dc.date.created	December 2012
dc.date.issued	2012-12
dc.date.updated	2014-03-13T21:29:26Z
dc.description.abstract	The velocity, variety, and volume of present-day data bring about new problems that Database Management Systems (DBMS) must handle. In particular, text data encapsulate the essence of the so-called ``unstructured data" and the need for efficient in-database algorithms and data structures for their analysis. Multiple solutions have been proposed for preprocessing, integrating, and analyzing heterogeneous sources via ad-hoc systems. This dissertation defends the idea that text corpora can be managed efficiently inside a relational database management system via SQL and database extensibility mechanisms. It presents data layouts and algorithms for preprocessing, storing, and querying text data efficiently within a DBMS. The optimizations focus on one-pass algorithms, pushing in-memory computations and reducing the number of I/Os. Furthermore, the DBDOC project introduces a new algorithm and properties for integrating and querying structured information stored in a relational database management system and the unstructured data living outside the DBMS realm. In a complementary manner, an original query recommendation algorithm based on OLAP cubes enhances the querying system for finding related concepts. The last chapter of this dissertation is based on exploring heterogeneous data analysis in the form of text corpora stored within a DBMS. The results of this research are a couple of original OLAP-based algorithms for extracting knowledge from text corpora. ONTOCUBE and CUBO focus on extracting and generating OLAP cubes from ontologies, respectively. The distinguishing trait of both algorithms is that they exploit the sparse nature of text data and perform efficient frequency summarizations. In addition, ONTOCUBE presents new measurements for building ontologies using OLAP cubes and CUBO formalizes the notation to map the hierarchy behind an ontology to compute multidimensional aggregations on classified documents. Finally, all of these document data exploration algorithms are then refined and adapted for exploring source code files that reference a database schema. This dissertation concludes with important open problems.
dc.description.department	Computer Science, Department of
dc.format.digitalOrigin	born digital
dc.format.mimetype	application/pdf
dc.identifier.uri	http://hdl.handle.net/10657/554
dc.language.iso	eng
dc.rights	The author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).
dc.subject	DBMS
dc.subject	Documents
dc.subject	OLAP
dc.subject	Source Code
dc.subject.lcsh	Computer science
dc.title	Integrating and Analyzing Databases and Interrelated Documents
dc.type.dcmi	Text
dc.type.genre	Thesis
thesis.degree.college	College of Natural Sciences and Mathematics
thesis.degree.department	Computer Science, Department of
thesis.degree.discipline	Computer Science
thesis.degree.grantor	University of Houston
thesis.degree.level	Doctoral
thesis.degree.major	Database Systems
thesis.degree.name	Doctor of Philosophy

Files

Original bundle

Now showing 1 - 1 of 1

Name:: cgarcia-disseration-version2.pdf
Size:: 1.11 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2.13 KB
Format:: Plain Text
Description:

Download

Collections

Published ETD Collection