Transforming Point Data into Proximity Graphs
Subramanyam Varalakshmi, Arjun 1991-
MetadataShow full item record
Establishing neighborhood relationships among data points is important for several data analysis applications, for which structural information is essential. Since the early 1980’s, proximity graphs have served as one of the classical approaches to characterize neighborhood relationships; the most popular proximity graphs include Delaunay Triangulation (DT), Gabriel graphs (GG), relative neighborhood graphs (RNG), and minimum spanning tree (MST). These graphs find their applications in geographical analysis, spatial analysis, pattern recognition, evolutionary biology, computer vision, cluster analysis, and visualization. The emergence of big data has created a need for scalable algorithms to generate proximity graphs for massive datasets. In this thesis, we propose a novel approach for creating DT, GG, and RNG by leveraging the cluster computing capabilities of Apache Spark. To compute DT, we rely on the divide and conquer paradigm. We first partition the spatial dataset into a grid; next, we compute the DT for each grid partition and separate the resulting triangles into the core and boundary triangles; next, we merge the adjacent partitions recursively and re-compute boundary triangles; finally, we return the union of obtained triangles as the final result. GG and RNG are implemented on top of a DT, by removing edges from the DT. In our GG algorithm, we compute the circumcenter for every triangle and associate it with edges of the triangle. The circumcenter coordinates associated with the non-boundary edges form the Voronoi edge. We filter those Delaunay edges, which do not intersect with their corresponding Voronoi edge, to obtain a GG. Relative neighborhood graphs are formed by removing those edges, for which the distance between the endpoints of an edge is less than the distance between any other point and one of the endpoints of the edge. We evaluate our algorithms to create DT, GG, and RNG for a benchmark of datasets having between one and eight million data records. Our benchmark tests showcase the significant performance improvement obtained with our algorithms and the impact of the grid partitioning technique on the runtime. We apply our algorithms to create clusters of Texas climate data and for boundary detection on a genetic dataset.