My research focuses on computer vision and machine learning for solving Visual Speech Recognition (VSR) which lies at the intersection of multiple modalities like videos (speech videos) audios (speech audio) and texts (Natural language). I have also worked in the space of Image stylization for enabling cross-modal transfer of style. My goal is to develope robust and scalable solutions for real world sensing problems using computer vision.
Prior to this, I have spent one year (2015-16) as a research fellow at CVIT working on a problem on cross-modal multimedia retrieval, under the supervision of Prof. Jawahar. Before moving to Hyderabad, I was a Manager (Planning), at Tata Steel Limited (2014-15) working towards automation and energy consumption optimization in processing plant.
I graduated from IIT Dhanbad, India, in 2014 with a B.Tech in Electronics and Communication Engineering. During my undergraduate years I worked closely with Prof. Mrinal Sen and Dr. Dilip Prasad on projects related to computer vision and robotics.
Accepted: Our paper “Towards Automatic Face-to-Face Translation” accepted in ACM Multimedia 2019.
Understanding videos of people speaking across international borders is hard as audiences from different demographies do not understand the language. Such speech videos are often supplemented with language subtitles. However, these hamper the viewing experience as the attention is shared. Simple audio dubbing in a different language makes the video appear unnatural due to unsynchronized lip motion. In this paper, we propose a system for automated cross-language lip synchronization for re-dubbed videos. Our model generates superior photorealistic lip-synchronization over original video in comparison to the current re-dubbing method. With the help of a user-based study, we verify that our method is preferred over unsynchronized videos.
We, humans, have the ability to easily imagine scenes that
depict sentences such as “Today is a beautiful sunny day” or
“There is a Christmas feel, in the air”. While it is hard to
precisely describe what one person may imagine, the essen-
tial high-level themes associated with such sentences largely
remains the same. The ability to synthesize novel images that
depict the feel of a sentence is very useful in a variety of appli-
cations such as education, advertisement, and entertainment.
While existing papers tackle this problem given a style im-
age, we aim to provide a far more intuitive and easy to use
solution that synthesizes novel renditions of an existing im-
age, conditioned on a given sentence. We present a method
for cross-modal style transfer between an English sentence
and an image, to produce a new image that imbibes the essen-
tial theme of the sentence. We do this by modifying the style
transfer mechanism used in image style transfer to incorpo-
rate a style component derived from the given sentence. We
demonstrate promising results using the YFCC100m dataset.
Our goal is to spot words in silent speech videos without
explicitly recognizing the spoken words, where the lip mo-
tion of the speaker is clearly visible and audio is absent. Ex-
isting work in this domain has mainly focused on recogniz-
ing a fixed set of words in word-segmented lip videos, which
limits the applicability of the learned model due to limited
vocabulary and high dependency on the model’s recogni-
Our contribution is two-fold: 1) we develop a pipeline
for recognition-free retrieval, and show its performance
against recognition-based retrieval on a large-scale dataset
and another set of out-of-vocabulary words. 2) We intro-
duce a query expansion technique using pseudo-relevant
feedback and propose a novel re-ranking method based on
maximizing the correlation between spatio-temporal land-
marks of the query and the top retrieval candidates. Our
word spotting method achieves 35% higher mean aver-
age precision over recognition-based method on large-scale
LRW dataset. Finally, we demonstrate the application of the
method by word spotting in a popular speech video (“
” by Charlie Chaplin) where we show that the
word retrieval can be used to understand what was spoken
perhaps in the silent movies.
While dealing with multi-modal data such as pairs of images and text, though individual samples may demonstrate inherent heterogeneity in their content, they are usually coupled with each other based on some higher-level concepts such as their categories. This shared information can be useful in measuring semantics of samples across modalities in a relative manner. In this paper, we investigate the problem of analysing the degree of specificity in the semantic content of a sample in one modality with respect to semantically similar samples in another modality. Samples that have high similarity with semantically similar samples from another modality are considered to be specific, while others are considered to be relatively ambiguous. To model this property, we propose a novel notion of “cross-specificity”. We present two mechanisms to measure cross-specificity: one based on human judgement and other based on an automated approach. We analyse different aspects of cross-specificity and demonstrate its utility in cross-modal retrieval task. Experiments show that though conceptually simple, it can benefit several existing cross-modal retrieval techniques and provide significant boost in their performance.
Teaching assistant (TA) in the course Topics in Machine Learning (CSE975). Course instructor: Prof. Naresh Manwani
Mentor in 1st foundations course on Artificial Intelligence and Machine Learning. Course instructor Prof. C. V. Jawahar
Organizing Team: 17th R&D showcase 2018, IIIT Hyderabad: showcase of exhibits and demonstration research projects and represents of IIIT-H’s most recent developments in research and innovation in technology.