Yafeng Yin’s research group has recently made new advancements in the field of human activity recognition and understanding. They have proposed a lip-reading technology based on acoustic sensing and self distillation for silent speech interaction. Additionally, they have introduced a sign language recognition and translation technology based on contrastive learning to assist in communication for the deaf.
1. Acoustic-Based Lip Reading for Mobile Devices: Dataset, Benchmark and a Self Distillation-Based Approach.
Speech is a natural communication way between people and a good way for human-computer interaction. However, speech with audible voices often faces the following problems, e.g., being affected by surrounding noises, breaking the quiet environment, leaking privacy, etc. Therefore, silent speech was proposed, especially lip reading, which aims to recognize speech content based on lip movements. In their research work, they utilize inaudible acoustic signals generated from mobile device to sense and recognize lip movements for lip reading. Firstly, considering the lack of public dataset in acoustic-based lip reading, they propose and release a large-scale lip-reading dataset LIPCMD with 30000 acoustic-based recordings. Secondly, to advance the further research in lip reading, they provide benchmark evaluation on LIPCMD. Thirdly, to recognize weak acoustic signals as words for lip reading, they propose a self-distillation based approach LipReader, which distills the probability distribution and attention map in convolutional neural network itself for better classification. Finally, they implement LipReader on smartphone and evaluate it on LIPCMD dataset as well as under complex scenarios. This research work has been accepted for publication in IEEE Transactions on Mobile Computing (CCF-A journal). Academic peers interested in this research are welcome to contact us for further discussions: yafeng@nju.edu.cn.
2. Contrastive Learning for Sign Language Recognition and Translation.
There are two problems that widely exist in current end-to-end sign language processing architecture. One is the CTC spike phenomenon which weakens the visual representational ability in Continuous Sign Language Recognition (CSLR). The other one is the exposure bias problem which leads to the accumulation of translation errors during inference in Sign Language Translation (SLT). In their research work, they tackle these issues by introducing contrast learning, aiming to enhance both visual-level feature representation and semantic-level error tolerance. Specifically, to alleviate CTC spike phenomenon and enhance visual-level representation, they design a visual contrastive loss by minimizing visual feature distance between different augmented samples of frames in one sign video, so that the model can further explore features by utilizing numerous unlabeled frames in an unsupervised way. To alleviate exposure bias problem and improve semantic-level error tolerance, they design a semantic contrastive loss by re-inputting the predicted sentence into semantic module and comparing features of ground-truth sequence and predicted sequence, for exposing model to its own mistakes. Finally, they conduct extensive experiments on current sign language datasets to demonstrate the effectiveness of their approach. This research work has been accepted for presentation at The 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023) (CCF-A conference). Academic peers interested in this research are welcome to contact us for further discussions: yafeng@nju.edu.cn.