Abhishek Jha
Doctoral Student, VISICS
Departement Elektrotechniek
Katholieke Universiteit Leuven
Email: abhishek [dot] jha [at] esat.kuleuven.be

Bio

I am an ELLIS doctoral student at KU Leuven, Belgium. I am advised by Prof. Tinne Tuytelaars at PSI division, Departement Elektrotechniek (ESAT), and co-supervised by Dr. Yuki Asano. My PhD research is focused on representation learning in the direction of robust and interpretable representations.

Prior to joining the doctoral school I have spent a brief time visiting IISc Bangalore as a Project Assistant at MALL Lab, working with Dr. Partha Pratim Talukdar, Dr. Anirban Chokraborty and Dr. Anand Mishra on a project on Weakly supervised video understanding using Knowledge Graphs (KG).

I completed my Masters (MS) at IIIT Hyderabad, where I was jointly advised by Prof. C. V. Jawahar and Prof. Vinay P. Namboodiri at Center for Visual Information Technology. My masters research was focused on solving Visual Speech Recognition (VSR) which lies at the intersection of multiple modalities like videos (speech videos) audios (speech audio) and texts (Natural language). I have also worked in the space of Image stylization for enabling cross-modal transfer of style.

Prior to this, I have spent one year (2015-16) as a research fellow at CVIT working on a problem on cross-modal multimedia retrieval, under the supervision of Prof. Jawahar. I graduated from IIT Dhanbad, India, in 2014 with a B.Tech in Electronics and Communication Engineering. During my undergraduate years I worked closely with Prof. Mrinal Sen and Dr. Dilip Prasad on projects related to computer vision and robotics.


Updates

                
[Jul 2023] Attending: ICVSS 2023, Sicily, Italy. I will also be presenting a poster on Exploring the stability of Self-Supervised Representations.
[Oct 2022] Accepted: Our paper “SimGlim: Simplifying glimpse based active visual reconstruction” in WACV 2023.
[Sep 2022] Accepted: Our paper Barlow constrained optimization for Visual Question Answering, in WACV 2023.
[Aug 2022] I will be attending ELLIS Doctoral symposium 2022 in Alicante, Spain.
[Aug 2021] Accepted: Our paper “Glimpse-Attend-and-Explore: Self-Attention for Active Visual Exploration” in ICCV 2021.
[Jan 2021] Presented a poster on “ Transferability of Self-Supervised Representations” in Mediterranean Machine Learning summer school 2021.
[Sep 2020] Will be attending Mediterranean Machine Learning summer school 2021 in January 2021.
[Aug 2020] Will be attending AI summer school 2020 (online), AI Singapore.
[Nov 2019] Joining KU Leuven, as a PhD student, at PSI ESAT.
[Jul 2019] Accepted: Our paper “Towards Automatic Face-to-Face Translation” accepted in ACM Multimedia 2019.
[May 2019] Vikram presented our work “Cross-Language Speech Dependent Lip-Synchronization” in ICASSP 2019, Brighton, UK.
[Apr 2019] Presented my work on “Audio-Visual Speech Recognition and Synthesis” at MPI-Informatics, Saarbrucken.
[Apr 2019] Successfully defended my MS thesis Audio-Visual Speech Recognition and Synthesis. Thesis Link.
[Feb 2019] Accepted: “Cross-Language Speech Dependent Lip-Synchronization” accepted in ICASSP 2019.
[Feb 2019] Will be spending next couple of months in IISc Bangalore as a visiting student.
[Jan 2019] Submitted my MS thesis at IIIT Hyderabad.
[Dec 2018] Paper “Spotting Words in Real World Videos : A Retrieval based approach” accepted in Journal of Machine Vision Application (MVA), Springer.
[Jul 2018] Presenting our work on “Lip-Synchronization for Dubbed Instructional Videos” at 2nd Research Symposium, IIIT Hyderabad.
[May 2018] Short paper “Lip-Synchronization for Dubbed Instructional Videos” accepted at CVPR 2018 Workshop (FIVER).
[May 2018] Giving a talk on “Introduction to Image Style Transfer”, at CVIT, IIIT Hyderabad.
[May 2018] Paper “Cross-Modal Style Transfer” accepted at ICIP.
[Apr 2018] Presenting our work on “Word-spotting in Silent Lip videos”, at 1st Research Symposium, IIIT Hyderabad.
[Mar 2018] Presenting our paper “Word-spotting in Silent Lip videos”, at WACV 2018, Lake Tahoe, CA.
[Fab 2018] Organizing annual R&D Showcase 2018, at IIIT Hyderabad.
[Jan 2018] Will be working as a “Mentor” for Foundations of Artificial Intelligence and Machine Learning.

Publications

 

Barlow_VQA

Barlow constrained Visual Question Answering
Abhishek Jha, Badri Patro, Luc Van Gool, Tinne Tuytelaars
Winter conference on Computer vision (WACV) 2023
[Arxiv]

Abstract
Visual question answering is a vision-and-language multimodal task, that aims at predicting answers given samples from the question and image modalities. Most recent methods focus on learning a good joint embedding space of images and questions, either by improving the interaction between these two modalities, or by making it a more discriminant space. However, how informative this joint space is, has not been well explored. In this paper, we propose a novel regularization for VQA models, Constrained Optimization using Barlow's theory (COB), that improves the information content of the joint space by minimizing the redundancy. It reduces the correlation between the learned feature components and thereby disentangles semantic concepts. Our model also aligns the joint space with the answer embedding space, where we consider the answer and image+question as two different `views' of what in essence is the same semantic information. We propose a constrained optimization policy to balance the categorical and redundancy minimization forces. When built on the state-of-the-art GGE model, the resulting model improves VQA accuracy by 1.4% and 4% on the VQA-CP v2 and VQA v2 datasets respectively. The model also exhibits better interpretability.

SimGlim_simplified_glimpse_active_visual_exploration

SimGlim: Simplified glimpse based active visual exploration
Abhishek Jha, Soroush Seifi, Tinne Tuytelaars
Winter conference on Computer vision (WACV) 2023

Abstract
An agent with a limited field of view needs to sample the most informative local observations of an environment in order to model the global context. Current works train this selection strategy by defining a complex architecture built upon features learned through convolutional encoders. In this paper, we first discuss why vision transformers are better suited than CNNs for such an agent. Next, we propose a simple transformer based active visual sampling model, called ''SimGlim'', which utilises transformer's inherent self-attention architecture to sequentially predict the best next location based on the current observable environment. We show the efficacy of our proposed method on the task of image reconstruction in the partial observable setting and compare our model against existing state-of-the-art active visual reconstruction methods. Finally, we provide ablations for the parameters of our design choice to understand their importance in the overall architecture.

Glimpse_Attend_and_Explore

Glimpse attend and explore: Self-Attention for Active Visual Exploration
Soroush Seifi, Abhishek Jha, Tinne Tuytelaars
International Conference on Computer Vision (ICCV) 2021
[Arxiv]

Abstract
Active visual exploration aims to assist an agent with a limited field of view to understand its environment based on partial observations made by choosing the best viewing directions in the scene. Recent methods have tried to address this problem either by using reinforcement learning, which is difficult to train, or by uncertainty maps, which are task-specific and can only be implemented for dense prediction tasks. In this paper, we propose the Glimpse-Attend-and-Explore model which: (a) employs self-attention to guide the visual exploration instead of task-specific uncertainty maps; (b) can be used for both dense and sparse prediction tasks; and (c) uses a contrastive stream to further improve the representations learned. Unlike previous works, we show the application of our model on multiple tasks like reconstruction, segmentation and classification. Our model provides encouraging results while being less dependent on dataset bias in driving the exploration. We further perform an ablation study to investigate the features and attention learned by our model. Finally, we show that our self-attention module learns to attend different regions of the scene by minimizing the loss on the downstream task.

cross_lip_sync

Cross-Language Speech Dependent Lip-Synchronization
Abhishek Jha, Vikram Voleti, Vinay P. Namboodiri, C. V. Jawahar
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2019
[Link]

Abstract
Understanding videos of people speaking across international borders is hard as audiences from different demographies do not understand the language. Such speech videos are often supplemented with language subtitles. However, these hamper the viewing experience as the attention is shared. Simple audio dubbing in a different language makes the video appear unnatural due to unsynchronized lip motion. In this paper, we propose a system for automated cross-language lip synchronization for re-dubbed videos. Our model generates superior photorealistic lip-synchronization over original video in comparison to the current re-dubbing method. With the help of a user-based study, we verify that our method is preferred over unsynchronized videos.

lip_word_spot

Spotting Words in Silent Speech Videos : A Retrieval based approach Abhishek Jha, Vinay P. Namboodiri, C. V. Jawahar
Journal of Machine Vision and Applications (MVA), Springer, 2018

[Paper]

Visual_dub

Lip-Synchronization for Dubbed Instructional Videos
Abhishek Jha, Vikram Voleti, Vinay P. Namboodiri, C. V. Jawahar
FIVER, CVPR Workshop 2018

[Short Paper] [Poster]

Visual_dub

Cross-modal style transfer
Sahil Chelaramani, Abhishek Jha, Anoop Namboodiri
IEEE International Conference on Image Processing (ICIP) 2018
[Paper]

Abstract
We, humans, have the ability to easily imagine scenes that depict sentences such as “Today is a beautiful sunny day” or “There is a Christmas feel, in the air”. While it is hard to precisely describe what one person may imagine, the essen- tial high-level themes associated with such sentences largely remains the same. The ability to synthesize novel images that depict the feel of a sentence is very useful in a variety of appli- cations such as education, advertisement, and entertainment. While existing papers tackle this problem given a style im- age, we aim to provide a far more intuitive and easy to use solution that synthesizes novel renditions of an existing im- age, conditioned on a given sentence. We present a method for cross-modal style transfer between an English sentence and an image, to produce a new image that imbibes the essen- tial theme of the sentence. We do this by modifying the style transfer mechanism used in image style transfer to incorpo- rate a style component derived from the given sentence. We demonstrate promising results using the YFCC100m dataset.

Visual_dub

Word Spotting in Silent Lip Videos
Abhishek Jha, Vinay P. Namboodiri, C. V. Jawahar
IEEE Winter Conference on Applications of Computer Vision (WACV) 2018
[Paper] [Poster] [Project Page]

Abstract
Our goal is to spot words in silent speech videos without explicitly recognizing the spoken words, where the lip mo- tion of the speaker is clearly visible and audio is absent. Ex- isting work in this domain has mainly focused on recogniz- ing a fixed set of words in word-segmented lip videos, which limits the applicability of the learned model due to limited vocabulary and high dependency on the model’s recogni- tion performance. Our contribution is two-fold: 1) we develop a pipeline for recognition-free retrieval, and show its performance against recognition-based retrieval on a large-scale dataset and another set of out-of-vocabulary words. 2) We intro- duce a query expansion technique using pseudo-relevant feedback and propose a novel re-ranking method based on maximizing the correlation between spatio-temporal land- marks of the query and the top retrieval candidates. Our word spotting method achieves 35% higher mean aver- age precision over recognition-based method on large-scale LRW dataset. Finally, we demonstrate the application of the method by word spotting in a popular speech video (“ The great dictator ” by Charlie Chaplin) where we show that the word retrieval can be used to understand what was spoken perhaps in the silent movies.

Visual_dub

Cross-specificity: modelling data semantics for cross-modal matching and retrieval
Yashaswi Verma, Abhishek Jha, C. V. Jawahar
International Journal of Multimedia Information Retrieval, Springer, June 2018
[Link]

Abstract
While dealing with multi-modal data such as pairs of images and text, though individual samples may demonstrate inherent heterogeneity in their content, they are usually coupled with each other based on some higher-level concepts such as their categories. This shared information can be useful in measuring semantics of samples across modalities in a relative manner. In this paper, we investigate the problem of analysing the degree of specificity in the semantic content of a sample in one modality with respect to semantically similar samples in another modality. Samples that have high similarity with semantically similar samples from another modality are considered to be specific, while others are considered to be relatively ambiguous. To model this property, we propose a novel notion of “cross-specificity”. We present two mechanisms to measure cross-specificity: one based on human judgement and other based on an automated approach. We analyse different aspects of cross-specificity and demonstrate its utility in cross-modal retrieval task. Experiments show that though conceptually simple, it can benefit several existing cross-modal retrieval techniques and provide significant boost in their performance.


Teaching

                 
Spring 2022: Teaching assistant (TA) in the course Information System and Signal Processing (B-KUL-H09M0A), KU Leuven. Course instructor: Prof. Tinne Tuytelaars
Spring 2021: Teaching assistant (TA) in the course Information System and Signal Processing (B-KUL-H09M0A), KU Leuven. Course instructor: Prof. Tinne Tuytelaars
Spring 2020: Teaching assistant (TA) in the course Information System and Signal Processing (B-KUL-H09M0A), KU Leuven. Course instructor: Prof. Tinne Tuytelaars
Monsoon 2018: Teaching assistant (TA) in the course Topics in Machine Learning (CSE975), IIIT Hyderabad. Course instructor: Prof. Naresh Manwani
Spring 2018: Mentor in 1st foundations course on Artificial Intelligence and Machine Learning. Course instructor Prof. C. V. Jawahar

Services

[Telangana Today] [APN News]

Other Activity


Abhishek Jha (c) 2018