Show simple item record

dc.contributor.advisorOrdonez, Carlos
dc.creatorGarcia Alvarado, Carlos 1983-
dc.date.accessioned2014-03-13T21:29:21Z
dc.date.available2014-03-13T21:29:21Z
dc.date.createdDecember 2012
dc.date.issued2012-12
dc.identifier.urihttp://hdl.handle.net/10657/554
dc.description.abstractThe velocity, variety, and volume of present-day data bring about new problems that Database Management Systems (DBMS) must handle. In particular, text data encapsulate the essence of the so-called ``unstructured data" and the need for efficient in-database algorithms and data structures for their analysis. Multiple solutions have been proposed for preprocessing, integrating, and analyzing heterogeneous sources via ad-hoc systems. This dissertation defends the idea that text corpora can be managed efficiently inside a relational database management system via SQL and database extensibility mechanisms. It presents data layouts and algorithms for preprocessing, storing, and querying text data efficiently within a DBMS. The optimizations focus on one-pass algorithms, pushing in-memory computations and reducing the number of I/Os. Furthermore, the DBDOC project introduces a new algorithm and properties for integrating and querying structured information stored in a relational database management system and the unstructured data living outside the DBMS realm. In a complementary manner, an original query recommendation algorithm based on OLAP cubes enhances the querying system for finding related concepts. The last chapter of this dissertation is based on exploring heterogeneous data analysis in the form of text corpora stored within a DBMS. The results of this research are a couple of original OLAP-based algorithms for extracting knowledge from text corpora. ONTOCUBE and CUBO focus on extracting and generating OLAP cubes from ontologies, respectively. The distinguishing trait of both algorithms is that they exploit the sparse nature of text data and perform efficient frequency summarizations. In addition, ONTOCUBE presents new measurements for building ontologies using OLAP cubes and CUBO formalizes the notation to map the hierarchy behind an ontology to compute multidimensional aggregations on classified documents. Finally, all of these document data exploration algorithms are then refined and adapted for exploring source code files that reference a database schema. This dissertation concludes with important open problems.
dc.format.mimetypeapplication/pdf
dc.language.isoeng
dc.subjectDBMS
dc.subjectDocuments
dc.subjectOLAP
dc.subjectSource Code
dc.subject.lcshComputer science
dc.titleIntegrating and Analyzing Databases and Interrelated Documents
dc.date.updated2014-03-13T21:29:26Z
dc.type.genreThesis
thesis.degree.nameDoctor of Philosophy
thesis.degree.levelDoctoral
thesis.degree.disciplineComputer Science
thesis.degree.grantorUniversity of Houston
thesis.degree.departmentComputer Science
dc.contributor.committeeMemberSubhlok, Jaspal
dc.contributor.committeeMemberPâris, Jehan-François
dc.contributor.committeeMemberGnawali, Omprakash
dc.contributor.committeeMemberAndrews, Richard
dc.type.dcmiText
dc.format.digitalOriginborn digital
thesis.degree.majorDatabase Systems
dc.description.departmentComputer Science
thesis.degree.collegeCollege of Natural Sciences and Mathematics


Files in this item


Thumbnail

This item appears in the following Collection(s)

Show simple item record