Integrating and Analyzing Databases and Interrelated Documents
The velocity, variety, and volume of present-day data bring about new problems that Database Management Systems (DBMS) must handle. In particular, text data encapsulate the essence of the so-called ``unstructured data" and the need for efficient in-database algorithms and data structures for their analysis. Multiple solutions have been proposed for preprocessing, integrating, and analyzing heterogeneous sources via ad-hoc systems. This dissertation defends the idea that text corpora can be managed efficiently inside a relational database management system via SQL and database extensibility mechanisms. It presents data layouts and algorithms for preprocessing, storing, and querying text data efficiently within a DBMS. The optimizations focus on one-pass algorithms, pushing in-memory computations and reducing the number of I/Os. Furthermore, the DBDOC project introduces a new algorithm and properties for integrating and querying structured information stored in a relational database management system and the unstructured data living outside the DBMS realm. In a complementary manner, an original query recommendation algorithm based on OLAP cubes enhances the querying system for finding related concepts. The last chapter of this dissertation is based on exploring heterogeneous data analysis in the form of text corpora stored within a DBMS. The results of this research are a couple of original OLAP-based algorithms for extracting knowledge from text corpora. ONTOCUBE and CUBO focus on extracting and generating OLAP cubes from ontologies, respectively. The distinguishing trait of both algorithms is that they exploit the sparse nature of text data and perform efficient frequency summarizations. In addition, ONTOCUBE presents new measurements for building ontologies using OLAP cubes and CUBO formalizes the notation to map the hierarchy behind an ontology to compute multidimensional aggregations on classified documents. Finally, all of these document data exploration algorithms are then refined and adapted for exploring source code files that reference a database schema. This dissertation concludes with important open problems.