Motion Histograms for Action Recognition Using a Convolutional Neural Network
Chotard, David Loic 1992-
MetadataShow full item record
Video recognition of human actions is an important research subject in the field of computer vision. Such solutions are particularly useful in video surveillance to detect potential felonies, but also find applications in other domains such as patient monitoring, video summarization, and video analysis. Even modern video games use action recognition algorithms so that the player becomes more active. The classification of human actions, however, remains a challenging problem due to the large variation in imaging conditions and individual attributes of people performing the action. One of the main difficulty in designing such systems lies in the fact the concept of action is closely related to the one of motion, which is a human impression. Indeed, considering frames independently disregarding the motion is not robust enough as different actions may look similar at a frame level. In this thesis, we use a convolutional neural network that can classify actions taking the motion information into account. Starting from a dataset of labeled actions, we compute the optical flow field that quantifies the motion across every two frames. Based on the flow orientations, we then build a histogram for each action that results in a low dimensionality representation. Actions are thus described as orientation distributions directly related to the motion. We finally use the histograms to train a convolutional neural network that can extract low-level features to increase the classification accuracy. We present results of our method across two benchmark datasets achieving 88.8% accuracy on the UCF Sports dataset, which consist in 13 reference sport actions, and up to 35.7% on the HMDB51 dataset, which contain 51 more complex actions voluntarily including ambiguities and mistakes. In both cases we outperform numerous methods and nearly reach state-of-the-art algorithms. Our method allows for encoding of discriminative features and facilitates action recognition independent of the length of the video. It constitutes a good alternative to non-linear classifiers usually used such as support vector machines.