Fu, Xin2022-06-30August 2022020-08August 202Portions of this document appear in: Zhang, Xingyao, et al. "Towards memory friendly long-short term memory networks (LSTMs) on mobile GPUs." 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2018; and in: Zhang, Xingyao, et al. "Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design." 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2020.https://hdl.handle.net/10657/10256In recent years, the neural networks have achieved great successes in the many area, e.g., automotive driving, medical and Intelligent Personal Assistants (IPAs). Among the neural network models, Long-Short Term Memory network (LSTM) and Capsule Network (CapsNet) are popular but exhibit low efficient when processed on the hardware device. In this dissertation, I introduce two hardware and software co-design approaches to efficiently execute the inference stage of the LSTM and the CapsNet. In the first work, we observe that LSTMs exhibit quite inefficient memory access pattern when executed on mobile GPUs due to the redundant data movements and limited off-chip bandwidth. To address the redundancy, we propose inter-cell level optimizations to improve the data locality across cells with negligible accuracy loss. To relax the pressure on limited offchip memory bandwidth, we propose intra-cell level optimizations that dynamically skip the loads and computations of rows in the weight matrices with trivial contribution to the outputs. We also introduce a light-weighted module to the GPUs architecture for the runtime row skipping in weight matrices. In the second work, CapsNet execution is observed low efficiency due to the execution features of their routing procedure, including massive unshareable intermediate variables and intensive synchronizations. we propose the software-hardware co-designed optimizations, SH-CapsNet, which includes the software-level optimizations named S-CapsNet and a hybrid computing architecture design named PIM-CapsNet . In software-level, S-CapsNet reduces the computation and memory accesses by exploiting the computational redundancy and data similarity of the routing procedure. In hardware-level, the PIM-CapsNet leverages the processing-in-memory capability of today’s 3D stacked memory to conduct the off-chip in-memory acceleration solution for the routing procedure, while pipelining with the GPU’s on-chip computing capability for accelerating CNN types of layers in CapsNet.application/pdfengThe author of this work is the copyright owner. UH Libraries and the Texas Digital Library have their permission to store and provide access to this work. UH Libraries has secured permission to reproduce any and all previously published materials contained in the work. Further transmission, reproduction, or presentation of this work is prohibited except with permission of the author(s).Computer ArchitectureMachine Learning AccelerationEmerging TechnologyProcessing in MemoryEnabling Efficient Neural Network Computation Via Hardware And Software Co-Design2022-06-30Thesisborn digital