Supervisors: Dr. Zhiwu Huang and Dr. Danda Pani Paudel
The growing use of commodity RGB-D sensors gives rise to the emergence of multi-modal data. For example, Microsoft Kinect’s output is a multi-modal signal which offers RGB videos, depth sequences, and human skeleton information, simultaneously. Most of the existing action recognition techniques focus on different single modalities, and merely built their classifiers over the features extracted from one of them. For better activity recognition, it is highly desirable to study the effectiveness of jointly leveraging from multi-modal information. With this motivation, this thesis explores whether the multiple cues can improve the accuracy of human action recognition. In particular, we introduce a multiple-stream network combining Parts-aware LSTM on 2D position information, a novel LieLSTM method for learning Lie group representation of 3D skeletal data, and a state-of-the-art Temporal Segment Network on RGB videos. The evaluation on one of the most popular standard RGB-D dataset (NTU RGB-D) shows that the proposed multiple-stream network can considerably outperform those with single modalities, while achieving a comparable performance against the other state-of-the-art methods.