Matrix Computations on TensorCore GPU



Journal Title

Journal ISSN

Volume Title



The emergence of neural engines such as Nvidia TensorCore GPU brings a revolution to deep neural networks, as the neural engines can perform extremely fast general matrix multiplications. However, how to deploy other algorithms or applications on neural engines remains questionable. In this dissertation, I explore the possibilities of using TensorCore GPU to accelerate BLAS3 operations, linear algebra algorithms, machine learning algorithms on GPUs, hybrid CPU-GPU architecture, and distributed systems. Specifically, I design TensorCore-based matrix computation algorithms that can work on different architectures. My work on single GPU includes implementing some of the basic linear algebra operations, which can be used in further matrix factorization. In terms of matrix factorization, I develop the recursive QR factorization that utilizes the TensorCore GPU efficiently. I also try to use TensorCore to accelerate the 2-stage Eigen Value Decomposition. In addition, I also tried to migrate the scalable CPU-based support vector machine tool to TensorCore, which exhibits a significant speedup and shows better performance than the state-of-art GPU-based SVM software. On the CPU-GPU hybrid architecture, I go a step further by investigating the recursive strategy, then do a case study of out-of-core QR factorization using the recursive strategy, and the results prove that the recursive algorithm works much better than the conventional algorithm. On the distributed memory system, my current work is developing a unique data structure named Universal Distributed Array (UDA), which has excellent programming flexibility and can utilize TensorCore as well. Generally speaking, the TensorCore-based algorithms typically have a very high performance, but it has to face the accuracy loss problem because of using half-precision.



TensorCore, Mixed-precision Computation, Numerical Linear Algebra, Kernel machine, Distributed Computing, GPGPU


Portions of this document appear in: Zhang, Shaoshuai, Vivek Karihaloo, and Panruo Wu. "Basic Linear Algebra Operations on TensorCore GPU." In 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA), pp. 44-52. IEEE, 2020; and in: Zhang, Shaoshuai, Elaheh Baharlouei, and Panruo Wu. "High accuracy matrix computations on neural engines: A study of qr factorization and its applications." In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pp. 17-28. 2020; and in: Zhang, Shaoshuai, and Panruo Wu. "Recursion Brings Speedup to Out-of-Core TensorCore-based Linear Algebra Algorithms: A Case Study of Classic Gram-Schmidt QR Factorization." In 50th International Conference on Parallel Processing, pp. 1-11. 2021; and in: Zhang, Shaoshuai, Ruchi Shah, and Panruo Wu. "TensorSVM: accelerating kernel machines with tensor engine." In Proceedings of the 34th ACM International Conference on Supercomputing, pp. 1-11. 2020.