Abhishek Jha
Doctoral Student, VISICS
Departement Elektrotechniek
Katholieke Universiteit Leuven
Email: abhishek [dot] jha [at] esat.kuleuven.be


I am a doctoral student at KU Leuven, Belgium. I am advised by Prof. Tinne Tuytelaars at PSI division, Departement Elektrotechniek (ESAT). My PhD research is focused on multimodal representation learning in the direction of robust and interpretable representations.

Prior to joining the doctoral school I have spent a brief time visiting IISc Bangalore as a Project Assistant at MALL Lab, working with Dr. Partha Pratim Talukdar, Dr. Anirban Chokraborty and Dr. Anand Mishra on a project on Weakly supervised video understanding using Knowledge Graphs (KG).

I completed my Masters (MS) at IIIT Hyderabad, where I was jointly advised by Prof. C. V. Jawahar and Prof. Vinay P. Namboodiri at Center for Visual Information Technology. My masters research was focused on computer vision and machine learning for solving Visual Speech Recognition (VSR) which lies at the intersection of multiple modalities like videos (speech videos) audios (speech audio) and texts (Natural language). I have also worked in the space of Image stylization for enabling cross-modal transfer of style.

Prior to this, I have spent one year (2015-16) as a research fellow at CVIT working on a problem on cross-modal multimedia retrieval, under the supervision of Prof. Jawahar. Before moving to Hyderabad, I was a Manager (Planning), at Tata Steel Limited (2014-15) working towards automation and energy consumption optimization in processing plant.

I graduated from IIT Dhanbad, India, in 2014 with a B.Tech in Electronics and Communication Engineering. During my undergraduate years I worked closely with Prof. Mrinal Sen and Dr. Dilip Prasad on projects related to computer vision and robotics.


[Aug 2021] Accepted: Our paper “Glimpse-Attend-and-Explore: Self-Attention for Active Visual Exploration” in ICCV 2021.
[Jan 2021] Presented a poster on “ Transferability of Self-Supervised Representations” in Mediterranean Machine Learning summer school 2021.
[Sep 2020] Will be attending Mediterranean Machine Learning summer school 2021 in January 2021.
[Aug 2020] Will be attending AI summer school 2020 (online), AI Singapore.
[Nov 2019] Joining KU Leuven, as a PhD student, at PSI ESAT.
[Jul 2019] Accepted: Our paper “Towards Automatic Face-to-Face Translation” accepted in ACM Multimedia 2019.
[May 2019] Vikram presented our work “Cross-Language Speech Dependent Lip-Synchronization” in ICASSP 2019, Brighton, UK.
[Apr 2019] Presented my work on “Audio-Visual Speech Recognition and Synthesis” at MPI-Informatics, Saarbrucken.
[Apr 2019] Successfully defended my MS thesis Audio-Visual Speech Recognition and Synthesis. Thesis Link.
[Feb 2019] Accepted: “Cross-Language Speech Dependent Lip-Synchronization” accepted in ICASSP 2019.
[Feb 2019] Will be spending next couple of months in IISc Bangalore as a visiting student.
[Jan 2019] Submitted my MS thesis at IIIT Hyderabad.
[Dec 2018] Paper “Spotting Words in Real World Videos : A Retrieval based approach” accepted in Journal of Machine Vision Application (MVA), Springer.
[Jul 2018] Presenting our work on “Lip-Synchronization for Dubbed Instructional Videos” at 2nd Research Symposium, IIIT Hyderabad.
[May 2018] Short paper “Lip-Synchronization for Dubbed Instructional Videos” accepted at CVPR 2018 Workshop (FIVER).
[May 2018] Giving a talk on “Introduction to Image Style Transfer”, at CVIT, IIIT Hyderabad.
[May 2018] Paper “Cross-Modal Style Transfer” accepted at ICIP.
[Apr 2018] Presenting our work on “Word-spotting in Silent Lip videos”, at 1st Research Symposium, IIIT Hyderabad.
[Mar 2018] Presenting our paper “Word-spotting in Silent Lip videos”, at WACV 2018, Lake Tahoe, CA.
[Fab 2018] Organizing annual R&D Showcase 2018, at IIIT Hyderabad.
[Jan 2018] Will be working as a “Mentor” for Foundations of Artificial Intelligence and Machine Learning.




Cross-Language Speech Dependent Lip-Synchronization
Abhishek Jha, Vikram Voleti, Vinay P. Namboodiri, C. V. Jawahar
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2019

Understanding videos of people speaking across international borders is hard as audiences from different demographies do not understand the language. Such speech videos are often supplemented with language subtitles. However, these hamper the viewing experience as the attention is shared. Simple audio dubbing in a different language makes the video appear unnatural due to unsynchronized lip motion. In this paper, we propose a system for automated cross-language lip synchronization for re-dubbed videos. Our model generates superior photorealistic lip-synchronization over original video in comparison to the current re-dubbing method. With the help of a user-based study, we verify that our method is preferred over unsynchronized videos.


Spotting Words in Silent Speech Videos : A Retrieval based approach Abhishek Jha, Vinay P. Namboodiri, C. V. Jawahar
Journal of Machine Vision and Applications (MVA), Springer, 2018



Lip-Synchronization for Dubbed Instructional Videos
Abhishek Jha, Vikram Voleti, Vinay P. Namboodiri, C. V. Jawahar
FIVER, CVPR Workshop 2018

[Short Paper] [Poster]


Cross-modal style transfer
Sahil Chelaramani, Abhishek Jha, Anoop Namboodiri
IEEE International Conference on Image Processing (ICIP) 2018

We, humans, have the ability to easily imagine scenes that depict sentences such as “Today is a beautiful sunny day” or “There is a Christmas feel, in the air”. While it is hard to precisely describe what one person may imagine, the essen- tial high-level themes associated with such sentences largely remains the same. The ability to synthesize novel images that depict the feel of a sentence is very useful in a variety of appli- cations such as education, advertisement, and entertainment. While existing papers tackle this problem given a style im- age, we aim to provide a far more intuitive and easy to use solution that synthesizes novel renditions of an existing im- age, conditioned on a given sentence. We present a method for cross-modal style transfer between an English sentence and an image, to produce a new image that imbibes the essen- tial theme of the sentence. We do this by modifying the style transfer mechanism used in image style transfer to incorpo- rate a style component derived from the given sentence. We demonstrate promising results using the YFCC100m dataset.


Word Spotting in Silent Lip Videos
Abhishek Jha, Vinay P. Namboodiri, C. V. Jawahar
IEEE Winter Conference on Applications of Computer Vision (WACV) 2018
[Paper] [Poster] [Project Page]

Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip mo- tion of the speaker is clearly visible and audio is absent. Ex- isting work in this domain has mainly focused on recogniz- ing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model’s recogni- tion performance. Our contribution is two-fold: 1) we develop a pipeline for recognition-free retrieval, and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. 2) We intro- duce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatio-temporal land- marks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean aver- age precision over recognition-based method on large-scale LRW dataset. Finally, we demonstrate the application of the method by word spotting in a popular speech video (“ The great dictator ” by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies.


Cross-specificity: modelling data semantics for cross-modal matching and retrieval
Yashaswi Verma, Abhishek Jha, C. V. Jawahar
International Journal of Multimedia Information Retrieval, Springer, June 2018

While dealing with multi-modal data such as pairs of images and text, though individual samples may demonstrate inherent heterogeneity in their content, they are usually coupled with each other based on some higher-level concepts such as their categories. This shared information can be useful in measuring semantics of samples across modalities in a relative manner. In this paper, we investigate the problem of analysing the degree of specificity in the semantic content of a sample in one modality with respect to semantically similar samples in another modality. Samples that have high similarity with semantically similar samples from another modality are considered to be specific, while others are considered to be relatively ambiguous. To model this property, we propose a novel notion of “cross-specificity”. We present two mechanisms to measure cross-specificity: one based on human judgement and other based on an automated approach. We analyse different aspects of cross-specificity and demonstrate its utility in cross-modal retrieval task. Experiments show that though conceptually simple, it can benefit several existing cross-modal retrieval techniques and provide significant boost in their performance.


Spring 2021: Teaching assistant (TA) in the course Information System and Signal Processing (B-KUL-H09M0A), KU Leuven. Course instructor: Prof. Tinne Tuytelaars
Spring 2020: Teaching assistant (TA) in the course Information System and Signal Processing (B-KUL-H09M0A), KU Leuven. Course instructor: Prof. Tinne Tuytelaars
Monsoon 2018: Teaching assistant (TA) in the course Topics in Machine Learning (CSE975), IIIT Hyderabad. Course instructor: Prof. Naresh Manwani
Spring 2018: Mentor in 1st foundations course on Artificial Intelligence and Machine Learning. Course instructor Prof. C. V. Jawahar


[Telangana Today] [APN News]

Other Activity

Abhishek Jha (c) 2018