Ayan Kumar Bhunia
I completed my Doctor of Philosophy (PhD), focusing on Computer Vision and Deep
Learning,
from
SketchX Lab,
Centre for Vision,
Speech and Signal Processing (CVSSP)
,
University of Surrey,
England, United Kingdom under the supervision of
Prof. Yi-Zhe Song
and
Prof. Tao(Tony) Xiang
.
Prior to that, I worked as a full-time research assistant at the
Institute for Media Innovation (IMI) Lab
of
Nanyang Technological University (NTU)
, Singapore.
Currently, I am working as a Senior Research Scientist (Computer Vision) at
iSIZE,
a London-based deep-tech company specializing in deep learning for video delivery.
Top-venue Conference publications (June 2022):
10xCVPR, 3xICCV, 3xECCV, 1xSiggraph Asia.
Google Scholar
 / 
GitHub
 / 
LinkedIn
 / 
DBLP
|
|
Recent Updates
New!!
[Oct 2022]: Defended my
PhD Thesis
before
Prof. Stella Yu
and
Prof.
Adrian Hilton
— with No corrections
New!!
[July 2022]: Two papers got accepted in ECCV
2022.
New!!
[March 2022]: Four papers got accepted in CVPR
2022.
[July 2021]: Three papers got accepted in ICCV
2021.
[June 2021]: Talk on 'Beyond Supervised Sketch
Representation Learning'
YouTube
[March 2021] : Four papers got accepted in CVPR
2021.
[Aug 2020] : One paper got accepted in Siggraph
Asia 2020. Check
Online Demo
[Aug 2020] : One paper got accepted in BMVC
2020 for oral presentation.
[July 2020] : One paper got accepted in ECCV
2020.
[March 2020] : One paper got accepted in CVPR
2020 for oral presentation.
Notes:
If you are interested in some potential research collaboration,
feel free to contact me by Email or
LinkedIn.
Most importantly, I would
be happy to collaborate with some really self-motivated and enthusiastic undergraduate or
post-graduate
students who have intention to pursue MS/Ph.D. in future.
|
Research Interests
My research focus is broadly centered around Computer Vision and Deep Learning.
I have tried to explore broadly three specific topics under computer vision.
a) Sketch for Visual Understanding:
Hand-drawn sketches
by nature inherit the
cognitive potential
of human intelligence, thus facilitating the application of sketches to various visual
understanding tasks.
During my PhD, my research centres around how sketches could be leveraged to address different
visual understanding problems.
For instance, I have extended the traditional
sketch-based image-retrieval (SBIR)
to an
on-the-fly retrieval setup
where the system starts retrieving as soon as the user starts drawing.
In addition, I have explored
Annotation-Efficient Learning
under a low-resource data scenario that includes a
semi-supervised
framework for cross-modal instance-level retrieval and
self-supervised learning
on
sparse image data like sketch/handwriting. Following the recent proliferation of touch-screen
devices,
sketch is a potential medium to interact with the digital system due to its fine-grained
personalized controlling ability.
Furthermore, I am looking forward to exploring how 2D sketches can facilitate creative image
generation/editing with 3D perception.
b) Document Image Analysis and Text Recognition:
Over the last few years, I have worked on
various problems of Document Analysis and Recognition.
A few representative works include MetaHTR (a writer-adaptive Handwritten Text Recognition system),
Unifying Handwritting and Scene Text Recognition, Unsupervised Document Image Binarization, Script
Identification, etc.
c) Image/Video Restoration:
In my current company at iSIZE, I work on developing state-of-the-art Image/Video Restoration
solutions. During my ongoing tenure with iSIZE, I have developed/designed the deep model for the
product BitClear from the ground up that provides a low-cost neural solution for compressed-video
artefacts removal.
BitClear won NAB product of the Year (2022) under AI/ML category. NAB Show is the largest show for
media, entertainment and technology.
Moreover, from an industry perspective, these are the following subtopics that I have explored till
now:
(i) Cross-modal Image Retrieval (ii) Generative Adversarial Network (iii) Meta-Learning
(iv) Self-supervised Learning (v) Image/Video Denoising
(vi)Text Recognition
(vii) Image to Sequence Generation (viii) RL for Vision
(ix) Semi-supervised Learning
(vii) Fine-grained Visual Recognition (viii) Knowledge Distillation
(ix) Incremental Learning.
|
|
FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context
Pinaki Nath Chowdhury, Aneeshan Sain,
Yulia Gryaditskaya,
Ayan Kumar Bhunia
,
Tao
Xiang
, Yi-Zhe Song
.
European Conference on Computer Vision(
ECCV
),
2022 (New!)
Abstract
/
Code
/
arXiv
/
BibTex
We advance sketch research to scenes with the first dataset of freehand scene sketches,
FS-COCO. With practical applications in mind, we collect sketches that convey well scene
content but can be sketched within a few minutes by a person with any sketching skills. Our
dataset comprises 10,000 freehand scene vector sketches with per point space-time information
by 100 non-expert individuals, offering both object- and scene-level abstraction. Each sketch
is augmented with its text description. Using our dataset, we study for the first time the
problem of fine-grained image retrieval from freehand scene sketches and sketch captions. We
draw insights on: (i) Scene salience encoded in sketches using the strokes temporal order;
(ii) Performance comparison of image retrieval from a scene sketch and an image caption; (iii)
Complementarity of information in sketches and image captions, as well as the potential
benefit of combining the two modalities. In addition, we extend a popular vector sketch
LSTM-based encoder to handle sketches with larger complexity than was supported by previous
work. Namely, we propose a hierarchical sketch decoder, which we leverage at a sketch-specific
“pre-text” task. Our dataset enables for the first time research on freehand scene sketch
understanding and its practical applications
@InProceedings{FSCOCO,
author = {Pinaki Nath Chowdhury and Aneeshan Sain and Yulia Gryaditskaya and Ayan Kumar Bhunia
and
and Tao Xiang and Yi-Zhe Song},
title = {FS-COCO: Towards Understanding of Freehand Sketches
of Common Objects in Context},
booktitle = {ECCV},
month = {October},
year = {2022}
}
|
|
Adaptive Fine-Grained Sketch-Based Image Retrieval
Ayan Kumar Bhunia
, Aneeshan Sain, Parth Hiren Shah,
Animesh Gupta, Pinaki Nath Chowdhury, Tao
Xiang
, Yi-Zhe Song
.
European Conference on Computer Vision(
ECCV
),
2022 (New!)
Abstract
/
Code
/
arXiv
/
BibTex
The recent focus on Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) has shifted towards
generalising a model to new categories without any training data from them. In real-world
applications, however, a trained FG-SBIR model is often applied to both new categories and
different human sketchers, i.e., different drawing styles. Although this complicates the
generalisation problem, fortunately, a handful of examples are typically available, enabling
the model to adapt to the new category/style. In this paper, we offer a novel perspective --
instead of asking for a model that generalises, we advocate for one that quickly adapts, with
just very few samples during testing (in a few-shot manner). To solve this new problem, we
introduce a novel model-agnostic meta-learning (MAML) based framework with several key
modifications: (1) As a retrieval task with a margin-based contrastive loss, we simplify the
MAML training in the inner loop to make it more stable and tractable. (2) The margin in our
contrastive loss is also meta-learned with the rest of the model. (3) Three additional
regularisation losses are introduced in the outer loop, to make the meta-learned FG-SBIR model
more effective for category/style adaptation. Extensive experiments on public datasets suggest
a large gain over generalisation and zero-shot based approaches, and a few strong few-shot
baselines..
@InProceedings{adaptivefgsbir,
author = {Ayan Kumar Bhunia and Aneeshan Sain and Parth Hiren Shah and Animesh Gupta and
Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song},
title = {Adaptive Fine-Grained Sketch-Based Image Retrieval},
booktitle = {ECCV},
month = {October},
year = {2022}
}
|
|
Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches
Ayan Kumar Bhunia
, Viswanatha Reddy Gajjala, Subhadeep Koley,
Rohit Kundu, Aneeshan Sain, Tao Xiang
, Yi-Zhe Song
.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
),
2022 (New!)
Abstract
/
Code
/
arXiv
/
Marktechpost Blog
/
BibTex
The human visual system is remarkable in learning new visual concepts from just a few
examples. This is precisely the goal behind few-shot class incremental learning (FSCIL), where
the emphasis is additionally placed on ensuring the model does not suffer from ``forgetting''.
{In this paper, we push the boundary further for FSCIL by addressing} two key questions that
bottleneck its ubiquitous application (i) can the model learn from diverse modalities other
than just photo (as humans do), and (ii) what if photos are not readily accessible (due to
ethical and privacy constraints). Our key innovation lies in advocating the use of sketches as
a new modality for class support. The product is a ``Doodle It Yourself" (DIY) FSCIL framework
where the users can freely sketch a few examples of a novel class for the model to learn to
recognize photos of that class. For that, we present a framework that infuses (i) gradient
consensus for domain invariant learning, (ii) knowledge distillation for preserving old class
information, and (iii) graph attention networks for message passing between old and novel
classes. We experimentally show that sketches are better class support than text in the
context of FSCIL, echoing findings elsewhere in the sketching literature.
@InProceedings{DoodleIncremental,
author = {Ayan Kumar Bhunia and Viswanatha Reddy Gajjala and Subhadeep Koley and Rohit Kundu and
Aneeshan Sain and Tao Xiang and Yi-Zhe Song},
title = {Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022}
}
|
|
Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval
Ayan Kumar Bhunia
, Subhadeep Koley, Abdullah Faiz Ur Rahman Khilji
, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song
.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
),
2022 (New!)
Abstract
/
Code
/
arXiv
/
BibTex
Sketching enables many exciting applications, notably, image retrieval. The
fear-to-sketch problem (i.e., ``I can't sketch") has however proven to be fatal for its
widespread adoption. This paper tackles this ``fear" head on, and for the first time,
proposes an auxiliary module for existing retrieval models that predominantly lets the users
sketch without having to worry. We first conducted a pilot study that revealed the secret
lies in the existence of noisy strokes, but not so much of the ``I can't sketch". We
consequently design a stroke subset selector that {detects noisy strokes, leaving only
those} which make a positive contribution towards successful retrieval. Our Reinforcement
Learning based formulation quantifies the importance of each stroke present in a given
subset, based on the extend to which that stroke contributes to retrieval. When combined
with pre-trained retrieval models as a pre-processing module, we achieve a significant gain
of 8\%-10\% over standard baselines and in turn report new state-of-the-art performance.
Last but not least, we demonstrate the selector once trained, can also be used in a
plug-and-play manner to empower various sketch applications in ways that were not previously
possible.
@InProceedings{strokesubset,
author = {Ayan Kumar Bhunia and Subhadeep Koley and Abdullah Faiz Ur Rahman Khilji and
Aneeshan Sain
and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song},
title = {Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022}
}
|
|
Partially Does It: Towards Scene-Level FG-SBIR with Partial Input
Pinaki Nath Chowdhury,
Ayan Kumar Bhunia
, Viswanatha Reddy Gajjala, Aneeshan Sain,
Tao Xiang, Yi-Zhe Song
.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
),
2022 (New!)
Abstract
/
Code
/
arXiv
/
BibTex
We scrutinise an important observation plaguing scene-level sketch research -- that a
significant portion of scene sketches are ``partial". A quick pilot study reveals: (i) a scene
sketch does not necessarily contain all objects in the corresponding photo, due to the
subjective holistic interpretation of scenes, (ii) there exists significant empty (white)
regions as a result of object-level abstraction, and as a result, (iii) existing scene-level
fine-grained sketch-based image retrieval methods collapse as scene sketches become more
partial. To solve this ``partial" problem, we advocate for a simple set-based approach using
optimal transport (OT) to model cross-modal region associativity in a partially-aware fashion.
Importantly, we improve upon OT to further account for holistic partialness by comparing
intra-modal adjacency matrices. Our proposed method is not only robust to partial
scene-sketches but also yields state-of-the-art performance on existing datasets.
@InProceedings{PartialSBIR,
author = {Pinaki Nath Chowdhury and Ayan Kumar Bhunia and Viswanatha Reddy Gajjala and Aneeshan
Sain and Tao Xiang and Yi-Zhe Song},
title = {Partially Does It: Towards Scene-Level FG-SBIR with Partial Input},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022}
}
|
|
Sketch3T: Test-time Training for Zero-Shot SBIR
Aneeshan Sain,
Ayan Kumar Bhunia
, Vaishnav Potlapalli , Pinaki Nath Chowdhury,
Tao Xiang, Yi-Zhe Song
.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
),
2022 (New!)
Abstract
/
Code
/
arXiv
/
BibTex
Zero-shot sketch-based image retrieval typically asks for a trained model to be applied as
is to unseen categories. In this paper, we question to argue that this setup by definition is
not compatible with the inherent abstract and subjective nature of sketches -- the model might
transfer well to new categories, but will not understand sketches existing in different
test-time distribution as a result. We thus extend ZS-SBIR asking it to transfer to both
categories and sketch distributions. Our key contribution is a test-time training paradigm
that can adapt using just one sketch. Since there is no paired photo, we make use of a sketch
raster-vector reconstruction module as a self-supervised auxiliary task. To maintain the
fidelity of the trained cross-modal joint embedding during test-time update, we design a novel
meta-learning based training paradigm to learn a separation between model updates incurred by
this auxiliary task from those off the primary objective of discriminative learning. Extensive
experiments show our model to outperform state-of-the-arts, thanks to the proposed test-time
adaption that not only transfers to new categories but also accommodates to new sketching
styles.
@InProceedings{Sketch3T,
author = {Aneeshan Sain and Ayan Kumar Bhunia and Vaishnav Potlapalli and Pinaki Nath Chowdhury
and Tao Xiang and Yi-Zhe Song},
title = {Sketch3T: Test-time Training for Zero-Shot SBIR},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2022}
}
|
|
Text is Text, No Matter What: Unifying Text Recognition using Knowledge
Distillation
Ayan Kumar Bhunia
, Aneeshan Sain, Pinaki Nath Chowdhury,
Yi-Zhe Song
.
IEEE International Conference on Computer Vision (
ICCV
),
2021
Abstract
/
arXiv
/
BibTex
Text recognition remains a fundamental and extensively researched topic in computer vision,
largely owing to its wide array of commercial applications. The challenging nature of the very
problem however dictated a fragmentation of research efforts: Scene Text Recognition (STR)
that
deals with text in everyday scenes, and Handwriting Text Recognition (HTR) that tackles
hand-written text. In this paper, for the first time, we argue for their unification -- we aim
for a single model that can compete favourably with two separate state-of-the-art STR and HTR
models. We first show that cross-utilisation of STR and HTR models trigger significant
performance drops due to differences in their inherent challenges. We then tackle their union
by
introducing a knowledge distillation (KD) based framework. This is however non-trivial,
largely
due to the variable-length and sequential nature of text sequences, which renders
off-the-shelf
KD techniques that mostly works with global fixed length data inadequate. For that, we propose
three distillation losses all of which specifically designed to cope with the aforementioned
unique characteristics of text recognition. Empirical evidence suggests that our proposed
unified model performs on par with individual models, even surpassing them in certain cases.
Ablative studies demonstrates that naive baselines such as a two-stage framework, and domain
adaption/generalisation alternatives do not work as well, further verifying the
appropriateness
of our design.
@InProceedings{textistext,
author = {Ayan Kumar Bhunia and Aneeshan Sain and Pinaki Nath Chowdhury and Yi-Zhe Song},
title = {Text is Text, No Matter What: Unifying Text Recognition using Knowledge
Distillation},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021}
}
|
|
Towards the Unseen: Iterative Text Recognition by Distilling from Errors
Ayan Kumar Bhunia
, Pinaki Nath Chowdhury, Aneeshan Sain,
Yi-Zhe Song
.
IEEE International Conference on Computer Vision (
ICCV
),
2021
Abstract
/
arXiv
/
BibTex
Visual text recognition is undoubtedly one of the most extensively researched topics in
computer vision. Great progress have been made to date, with the latest models starting to
focus
on the more practical ``in-the-wild'' setting. However, a salient problem still hinders
practical deployment -- prior arts mostly struggle with recognising unseen (or rarely seen)
character sequences. In this paper, we put forward a novel framework to specifically tackle
this
``unseen'' problem. Our framework is iterative in nature, in that it utilises predicted
knowledge of character sequences from a previous iteration, to augment the main network in
improving the next prediction. Key to our success is a unique cross-modal variational
autoencoder to act as a feedback module, which is trained with the presence of textual error
distribution data. This module importantly translate a discrete predicted character space, to
a
continuous affine transformation parameter space used to condition the visual feature map at
next iteration. Experiments on common datasets have shown competitive performance over
state-of-the-arts under the conventional setting. Most importantly, under the new disjoint
setup
where train-test labels are mutually exclusive, ours offers the best performance thus
showcasing
the capability of generalising onto unseen words.
@InProceedings{towardsunseen,
author = {Ayan Kumar Bhunia and Pinaki Nath Chowdhury and Aneeshan Sain and Yi-Zhe Song},
title = {Towards the Unseen: Iterative Text Recognition by Distilling from Errors},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021}
}
|
|
Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition
Ayan Kumar Bhunia
, Aneeshan Sain,
Amandeep Kumar, Shuvozit Ghose, Pinaki
Nath
Chowdhury,
Yi-Zhe Song
.
IEEE International Conference on Computer Vision (
ICCV
),
2021
Abstract
/
arXiv
/
BibTex
Although text recognition has significantly evolved over the years, state-of the-art
(SOTA)
models still struggle in the wild scenarios due to complex backgrounds, varying fonts,
uncontrolled illuminations, distortions and other artifacts. This is because such models
solely
depend on visual information for text recognition, thus lacking semantic reasoning
capabilities.
In this paper, we argue that semantic information offers a complimentary role in addition to
visual only. More specifically, we additionally utilize semantic information by proposing a
multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning.
Our
novelty lies in the intuition that for text recognition, prediction should be refined in a
stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling
attentional decoder where non-differentiability, invoked by discretely predicted character
labels, needs to be bypassed for end-to-end training. While the first stage predicts using
visual features, subsequent stages refine on-top of it using joint visual-semantic
information.
Additionally, we introduce multi-scale 2D attention along with dense and residual
connections
between different stages to deal with varying scales of character sizes, for better
performance
and faster convergence during training. Experimental results show our approach to outperform
existing SOTA methods by a considerable margin.
@InProceedings{JVSR,
author = {Ayan Kumar Bhunia and Aneeshan Sain and Amandeep Kumar and Shuvozit Ghose and Pinaki
Nath Chowdhury and Yi-Zhe Song},
title = {Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2021}
}
|
|
Vectorization and Rasterization: Self-Supervised Learning for Sketch and
Handwriting
Ayan Kumar Bhunia
, Pinaki Nath Chowdhury, Yongxin Yang,
Timothy Hospedales, Tao Xiang,
Yi-Zhe Song
.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
),
2021
Abstract
/
Code
/
arXiv
/
BibTex
Self-supervised learning has gained prominence due to its efficacy at learning powerful
representations from unlabelled data that achieve excellent performance on many challenging
downstream tasks. However supervision-free pre-text tasks are challenging to design and
usually
modality specific. Although there is a rich literature of self-supervised methods for either
spatial (such as images) or temporal data (sound or text) modalities, a common pre-text task
that benefits both modalities is largely missing. In this paper, we are interested in defining
a
self-supervised pre-text task for sketches and handwriting data. This data is uniquely
characterised by its existence in dual modalities of rasterized images and vector coordinate
sequences. We address and exploit this dual representation by proposing two novel cross-modal
translation pre-text tasks for self-supervised feature learning: Vectorization and
Rasterization. Vectorization learns to map image space to vector coordinates and rasterization
maps vector coordinates to image space. We show that the our learned encoder modules benefit
both raster-based and vector-based downstream approaches to analysing hand-drawn data.
Empirical
evidence shows that our novel pre-text tasks surpass existing single and multi-modal
self-supervision methods.
@InProceedings{sketch2vec,
author = {Ayan Kumar Bhunia and Pinaki Nath Chowdhury and Yongxin Yang and Timothy Hospedales
and
Tao Xiang and Yi-Zhe Song},
title = {Vectorization and Rasterization: Self-Supervised Learning for Sketch and
Handwriting},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2021}
}
|
|
More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch
Based Image Retrieval
Ayan Kumar Bhunia
, Pinaki Nath Chowdhury, Aneeshan Sain,
Yongxin Yang, Tao Xiang,
Yi-Zhe Song
.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
),
2021
Abstract
/
Code
/
arXiv
/
BibTex
A fundamental challenge faced by existing Fine-Grained Sketch-Based Image Retrieval
(FG-SBIR)
models is the data scarcity -- model performances are largely bottlenecked by the lack of
sketch-photo pairs. Whilst the number of photos can be easily scaled, each corresponding
sketch
still needs to be individually produced. In this paper, we aim to mitigate such an upper-bound
on sketch data, and study whether unlabelled photos alone (of which they are many) can be
cultivated for performances gain. In particular, we introduce a novel semi-supervised
framework
for cross-modal retrieval that can additionally leverage large-scale unlabelled photos to
account for data scarcity. At the centre of our semi-supervision design is a sequential
photo-to-sketch generation model that aims to generate paired sketches for unlabelled photos.
Importantly, we further introduce a discriminator guided mechanism to guide against unfaithful
generation, together with a distillation loss based regularizer to provide tolerance against
noisy training samples. Last but not least, we treat generation and retrieval as two conjugate
problems, where a joint learning procedure is devised for each module to mutually benefit from
each other. Extensive experiments show that our semi-supervised model yields significant
performance boost over the state-of-the-art supervised alternatives, as well as existing
methods
that can exploit unlabelled photos for FG-SBIR.
@InProceedings{semi-fgsbir,
author = {Ayan Kumar Bhunia and Pinaki Nath Chowdhury and Aneeshan Sain and Yongxin Yang and Tao
Xiang and Yi-Zhe Song},
title = {More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based
Image Retrieval},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2021}
}
|
|
StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval
Aneeshan Sain,
Ayan Kumar Bhunia
,
Yongxin Yang and ,
Tao Xiang,
Yi-Zhe Song
.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
),
2021
Abstract
/
arXiv
/
BibTex
Sketch-based image retrieval (SBIR) is a cross-modal matching problem which is typically
solved by learning a joint embedding space where the semantic content shared between the photo
and sketch modalities are preserved. However, a fundamental challenge in SBIR has been largely
ignored so far, that is, sketches are drawn by humans and considerable style variations exist
between different users. An effective SBIR model needs to explicitly account for this style
diversity, and crucially to generalise to unseen user styles. To this end, a novel
style-agnostic SBIR model is proposed. Different from existing models, a cross-modal
variational
autoencoder (VAE) is employed to explicitly disentangle each sketch into a semantic content
part
shared with the corresponding photo and a style part unique to the sketcher. Importantly, to
make our model dynamically adaptable to any unseen user styles, we propose to meta-train our
cross-modal VAE by adding two style-adaptive components: a set of feature transformation
layers
to its encoder and a regulariser to the disentangled semantic content latent code. With this
meta-learning framework, our model can not only disentangle the cross-modal shared semantic
content for SBIR, but can adapt the disentanglement to any unseen user styles as well, making
the SBIR model truly style-agnostic. Extensive experiments show that our style-agnostic model
yields state-of-the-art performance for both category-level and instance-level SBIR.
@InProceedings{stylemeup,
author = {Aneeshan Sain and Ayan Kumar Bhunia and Yongxin Yang and Tao Xiang and Yi-Zhe
Song},
title = {StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2021}
}
|
|
MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition
Ayan Kumar Bhunia
, Shuvozit Ghose, Amandeep Kumar,
Pinaki Nath Chowdhury, Aneeshan Sain,
Yi-Zhe Song
.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
),
2021
Abstract
/
arXiv
/
BibTex
Handwritten Text Recognition (HTR) remains a challenging problem to date, largely due to the
varying writing styles that exist amongst us. Prior works however generally operate with the
assumption that there is a limited number of styles, most of which have already been captured
by
existing datasets. In this paper, we take a completely different perspective -- we work on the
assumption that there is always a new style that is drastically different, and that we will
only
have very limited data during testing to perform adaptation. This results in a commercially
viable solution -- the model has the best shot at adaptation being exposed to the new style,
and
the few samples nature makes it practical to implement. We achieve this via a novel
meta-learning framework which exploits additional new-writer data through a support set, and
outputs a writer-adapted model via single gradient step update, all during inference. We
discover and leverage on the important insight that there exists few key characters per writer
that exhibit relatively larger style discrepancies. For that, we additionally propose to
meta-learn instance specific weights for a character-wise cross-entropy loss, which is
specifically designed to work with the sequential nature of text data. Our writer-adaptive
MetaHTR framework can be easily implemented on the top of most state-of-the-art HTR models.
Experiments show an average performance gain of 5-7% can be obtained by observing very few new
style data. We further demonstrate via a set of ablative studies the advantage of our meta
design when compared with alternative adaption mechanisms.
@InProceedings{metahtr,
author = {Ayan Kumar Bhunia and Shuvozit Ghose, Amandeep Kumar and Pinaki Nath Chowdhury and
Aneeshan Sain and Yi-Zhe Song},
title = {MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2021}
}
|
|
Pixelor: A Competitive Sketching AI Agent. So you think you can beat me?
Ayan Kumar Bhunia*
, Ayan Das*, Umar Riaz Muhammad*,
Yongxin Yang, Timothy M. Hospedalis,
Tao Xiang, Yulia Gryaditskaya, Yi-Zhe
Song
.
SIGGRAPH Asia
, 2020.
Abstract
/
Code
/
arXiv
/
BibTex
/
Try Online Demo
(*equal contribution)
We present the first competitive drawing agent Pixelor that exhibits human-level
performance
at a Pictionary-like sketching game, where the participant whose sketch is recognized first is
a
winner. Our AI agent can autonomously sketch a given visual concept, and achieve a
recognizable
rendition as quickly or faster than a human competitor. The key to victory for the agent is to
learn the optimal stroke sequencing strategies that generate the most recognizable and
distinguishable strokes first. Training Pixelor is done in two steps. First, we infer the
optimal stroke order that maximizes early recognizability of human training sketches. Second,
this order is used to supervise the training of a sequence-to-sequence stroke generator. Our
key
technical contributions are a tractable search of the exponential space of orderings using
neural sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that uses an
optimal-transport loss to accommodate the multi-modal nature of the optimal stroke
distribution.
Our analysis shows that Pixelor is better than the human players of the Quick, Draw! game,
under
both AI and human judging of early recognition. To analyze the impact of human competitors’
strategies, we conducted a further human study with participants being given unlimited
thinking
time and training in early recognizability by feedback from an AI judge. The study shows that
humans do gradually improve their strategies with training, but overall Pixelor still matches
human performance. We will release the code and the dataset, optimized for the task of early
recognition, upon acceptance.
@InProceedings{sketchxpixelor,
author = {Ayan Kumar Bhunia and Ayan Das and Umar Riaz Muhammad and Yongxin Yang and Timothy M.
Hospedales and Tao Xiang and Yulia Gryaditskaya and Yi-Zhe Song},
title = {Pixelor: A Competitive Sketching AI Agent. So you think you can beat me?},
booktitle = {Siggraph Asia},
month = {November},
year = {2020}
}
|
|
Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval
Aneeshan Sain,
Ayan Kumar Bhunia
, Yongxin Yang,
Tao Xiang, Yi-Zhe Song
.
British Machine Vision Conference (
BMVC
), 2020.
Abstract
/
arXiv
/
BibTex
(Oral Presentation)
Sketch as an image search query is an ideal alternative to text in capturing the
fine-grained
visual details. Prior successes on fine-grained sketch-based image retrieval (FG-SBIR) have
demonstrated the importance of tackling the unique traits of sketches as opposed to photos,
e.g., temporal vs. static, strokes vs. pixels, and abstract vs. pixel-perfect. In this paper,
we
study a further trait of sketches that has been overlooked to date, that is, they are
hierarchical in terms of the levels of detail -- a person typically sketches up to various
extents of detail to depict an object. This hierarchical structure is often visually distinct.
In this paper, we design a novel network that is capable of cultivating sketch-specific
hierarchies and exploiting them to match sketch with photo at corresponding hierarchical
levels.
In particular, features from a sketch and a photo are enriched using cross-modal co-attention,
coupled with hierarchical node fusion at every level to form a better embedding space to
conduct
retrieval. Experiments on common benchmarks show our method to outperform state-of-the-arts by
a
significant margin.
@InProceedings{sain2020crossmodal,
author = {Aneeshan Sain and Ayan Kumar Bhunia and Yongxin Yang and Tao Xiang and Yi-Zhe
Song},
title = {Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval},
booktitle = {BMVC},
month = {September},
year = {2020}
}
|
|
Fine-grained visual classification via progressive multi-granularity training of
jigsaw patches
Ruoyi Du, Dongliang Chang,
Ayan Kumar Bhunia
, Jiyang Xie,
Zhanyu Ma, Yi-Zhe Song
, Jun Guo
.
European Conference on Computer Vision (
ECCV
), 2020.
Abstract
/
Code/
arXiv
/
BibTex
Fine-grained visual classification (FGVC) is much more challenging than traditional
classification tasks due to the inherently subtle intra-class object variations. Recent works
are
mainly part-driven (either explicitly or implicitly), with the assumption that fine-grained
information naturally rests within the parts. In this paper, we take a different stance, and
show that part operations are not strictly necessary -- the key lies with encouraging the
network to learn at different granularities and progressively fusing multi-granularity
features
together. In particular, we propose: (i) a progressive training strategy that effectively
fuses
features from different granularities, and (ii) a random jigsaw patch generator that
encourages
the network to learn features at specific granularities. We evaluate on several standard FGVC
benchmark datasets, and show the proposed method consistently outperforms existing
alternatives
or delivers competitive results.
@InProceedings{du2020fine,
author = {Du, Ruoyi and Chang, Dongliang and Bhunia, Ayan Kumar and Xie, Jiyang and Song, Yi-Zhe
and Ma, Zhanyu and Guo, Jun},
title = {Fine-grained visual classification via progressive multi-granularity training of jigsaw
patches},
booktitle = {ECCV},
month = {August},
year = {2020}
}
|
|
Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image Retrieval
Ayan Kumar Bhunia
, Yongxin Yang, Timothy M. Hospedalis,
Tao Xiang, Yi-Zhe Song.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), 2020.
Abstract
/
Code
/
arXiv
/
BibTex
(Oral Presentation)
Fine-grained sketch-based image retrieval (FG-SBIR) addresses the problem of retrieving a
particular photo instance given a user's query sketch. Its widespread applicability is however
hindered by the fact that drawing a sketch takes time, and most people struggle to draw a
complete and faithful sketch. In this paper, we reformulate the conventional FG-SBIR framework
to tackle these challenges, with the ultimate goal of retrieving the target photo with the
least
number of strokes possible. We further propose an on-the-fly design that starts retrieving as
soon as the user starts drawing. To accomplish this, we devise a reinforcement learning based
cross-modal retrieval framework that directly optimizes rank of the ground-truth photo over a
complete sketch drawing episode. Additionally, we introduce a novel reward scheme that
circumvents the problems related to irrelevant sketch strokes, and thus provides us with a
more
consistent rank list during the retrieval. We achieve superior early-retrieval efficiency over
state-of-the-art methods and alternative baselines on two publicly available fine-grained
sketch
retrieval datasets.
@InProceedings{bhunia2020sketch,
author = {Ayan Kumar Bhunia and Yongxin Yang and Timothy M. Hospedales and Tao Xiang and Yi-Zhe
Song},
title = {Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image Retrieval},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}
|
|
Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning
Ayan Kumar Bhunia
, Abhirup Das, Ankan Kumar Bhunia,
Perla Sai Raj Kishore, Partha Pratim Roy.
IEEE Conference on Computer Vision and Pattern Recognition (
CVPR
), 2019
Abstract
/
Code
/
arXiv
/
BibTex
Handwritten Word Recognition and Spotting is a challenging field dealing with handwritten text
possessing irregular and complex shapes. The design of deep neural network models makes it
necessary to extend training datasets in order to introduce variations and increase the number
of samples; word-retrieval is therefore very difficult in low-resource scripts. Much of the
existing literature comprises preprocessing strategies which are seldom sufficient to cover
all
possible variations. We propose an Adversarial Feature Deformation Module (AFDM) that learns
ways to elastically warp extracted features in a scalable manner. The AFDM is inserted between
intermediate layers and trained alternatively with the original framework, boosting its
capability to better learn highly informative features rather than trivial ones. We test our
meta-framework, which is built on top of popular word-spotting and word-recognition frameworks
and enhanced by AFDM, not only on extensive Latin word datasets but also on sparser Indic
scripts. We record results for varying sizes of training data, and observe that our enhanced
network generalizes much better in the low-data regime; the overall word-error rates and mAP
scores are observed to improve as well.
@InProceedings{Bhunia_2019_CVPR,
author = {Bhunia, Ayan Kumar and Das, Abhirup and Bhunia, Ankan Kumar and Kishore, Perla Sai Raj
and Roy, Partha Pratim},
title = {Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2019}
}
|
|
Improving Document Binarization via Adversarial Noise-Texture Augmentation
Ankan Kumar Bhunia,
Ayan Kumar Bhunia
,
Aneeshan Sain, Partha Pratim Roy.
IEEE Conference on Image Processing (
ICIP
), 2019
Abstract
/
Code
/
arXiv
/
BibTex
(Top 10% Papers)
Binarization of degraded document images is an elementary step in most of the problems in
document image analysis domain. The paper re-visits the binarization problem by introducing an
adversarial learning approach. We construct a Texture Augmentation Network that transfers the
texture element of a degraded reference document image to a clean binary image. In this way,
the
network creates multiple versions of the same textual content with various noisy textures,
thus
enlarging the available document binarization datasets. At last, the newly generated images
are
passed through a Binarization network to get back the clean version. By jointly training the
two
networks we can increase the adversarial robustness of our system. Also, it is noteworthy that
our model can learn from unpaired data. Experimental results suggest that the proposed method
achieves superior performance over widely used DIBCO datasets.
@inproceedings{bhunia2019improving,
title={Improving document binarization via adversarial noise-texture augmentation},
author={Bhunia, Ankan Kumar and Bhunia, Ayan Kumar and Sain, Aneeshan and Roy, Partha Pratim},
booktitle={2019 IEEE International Conference on Image Processing (ICIP)},
pages={2721--2725},
year={2019},
organization={IEEE}
}
|
|
A Deep One-Shot Network for Query-based Logo Retrieval
Ayan Kumar Bhunia
, Ankan Kumar Bhunia,
Shuvozit Ghose, Abhirup Das, Partha
Pratim
Roy, Umapada Pal
Pattern Recognition (
PR
), 2019
Abstract
/
Code
/
Third Party Implementation
/
arXiv
/
BibTex
Logo detection in real-world scene images is an important problem with applications in
advertisement and marketing. Existing general-purpose object detection methods require large
training data with annotations for every logo class. These methods do not satisfy the
incremental demand of logo classes necessary for practical deployment since it is practically
impossible to have such annotated data for new unseen logo. In this work, we develop an
easy-to-implement query-based logo detection and localization system by employing a one-shot
learning technique using off the shelf neural network components. Given an image of a query
logo, our model searches for logo within a given target image and predicts the possible
location
of the logo by estimating a binary segmentation mask. The proposed model consists of a
conditional branch and a segmentation branch. The former gives a conditional latent
representation of the given query logo which is combined with feature maps of the segmentation
branch at multiple scales in order to obtain the matching location of the query logo in a
target
image. Feature matching between the latent query representation and multi-scale feature maps
of
segmentation branch using simple concatenation operation followed by 1 × 1 convolution layer
makes our model scale-invariant. Despite its simplicity, our query-based logo retrieval
framework achieved superior performance in FlickrLogos-32 and TopLogos-10 dataset over
different
existing baseline methods.
@article{bhunia2019deep,
title={A Deep One-Shot Network for Query-based Logo Retrieval},
author={Bhunia, Ayan Kumar and Bhunia, Ankan Kumar and Ghose, Shuvozit and Das, Abhirup and Roy,
Partha Pratim and Pal, Umapada},
journal={Pattern Recognition},
pages={106965},
year={2019},
publisher={Elsevier}}
|
|
User Constrained Thumbnail Generation Using Adaptive Convolutions
Perla Sai Raj Kishore,
Ayan Kumar Bhunia
, Shovozit Ghose,
Partha Pratim Roy
International Conference on Acoustics, Speech and Signal Processing (
ICASSP
), 2019
Abstract
/
Code
/
arXiv
/
BibTex
(Oral Presentation)
Thumbnails are widely used all over the world as a preview for digital images. In this work we
propose a deep neural framework to generate thumbnails of any size and aspect ratio, even for
unseen values during training, with high accuracy and precision. We use Global Context
Aggregation (GCA) and a modified Region Proposal Network (RPN) with adaptive convolutions to
generate thumbnails in real time. GCA is used to selectively attend and aggregate the global
context information from the entire image while the RPN is used to generate candidate bounding
boxes for the thumbnail image. Adaptive convolution eliminates the difficulty of generating
thumbnails of various aspect ratios by using filter weights dynamically generated from the
aspect ratio information. The experimental results indicate the superior performance of the
proposed model 1 over existing state-of-the-art techniques.
@inproceedings{kishore2019user,
title={User Constrained Thumbnail Generation Using Adaptive Convolutions},
author={Kishore, Perla Sai Raj and Bhunia, Ayan Kumar and Ghose, Shuvozit and Roy, Partha
Pratim},
booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP)},
pages={1677--1681},
year={2019},
organization={IEEE}
}
|
|
Texture Synthesis Guided Deep Hashing for Texture Image Retrieval
Ayan Kumar Bhunia
,
Perla Sai Raj Kishore,
Pranay Mukherjee,
Abhirup Das,
Partha Pratim Roy
IEEE Winter Conference on Applications of Computer Vision (
WACV
), 2019
Abstract
/
arXiv
/
BibTex
/
Video Presentation
With the large scale explosion of images and videos over the internet, efficient hashing
methods
have been developed to facilitate memory and time efficient retrieval of similar images.
However, none of the existing works use hashing to address texture image retrieval mostly
because of the lack of sufficiently large texture image databases. Our work addresses this
problem by developing a novel deep learning architecture that generates binary hash codes for
input texture images. For this, we first pre-train a Texture Synthesis Network (TSN) which
takes
a texture patch as input and outputs an enlarged view of the texture by injecting newer
texture
content. Thus it signifies that the TSN encodes the learnt texture specific information in its
intermediate layers. In the next stage, a second network gathers the multi-scale feature
representations from the TSN’s intermediate layers using channel-wise attention, combines them
in a progressive manner to a dense continuous representation which is finally converted into a
binary hash code with the help of individual and pairwise label information. The new enlarged
texture patches from the TSN also help in data augmentation to alleviate the problem of
insufficient texture data and are used to train the second stage of the network. Experiments
on
three public texture image retrieval datasets indicate the superiority of our texture
synthesis
guided hashing approach over existing state-of-the-art methods.
@inproceedings{bhunia2019texture,
title={Texture synthesis guided deep hashing for texture image retrieval},
author={Bhunia, Ayan Kumar and Perla, Sai Raj Kishore and Mukherjee, Pranay and Das, Abhirup and
Roy, Partha Pratim},
booktitle={2019 IEEE Winter Conference on Applications of Computer Vision (WACV)},
pages={609--618},
year={2019},
organization={IEEE}
}
|
|
Script identification in natural scene image and video frames using an attention
based Convolutional-LSTM network
Ankan Kumar Bhunia, Aishik Konwer,
Ayan Kumar Bhunia
,
Abir Bhowmick, Partha Pratim Roy, Umapada
Pal
Pattern Recognition (
PR
), 2019
Abstract
/
Code
/
arXiv
/
BibTex
Script identification plays a significant role in analysing documents and videos. In this
paper,
we focus on the problem of script identification in scene text images and video scripts.
Because
of low image quality, complex background and similar layout of characters shared by some
scripts
like Greek, Latin, etc., text recognition in those cases become challenging. In this paper, we
propose a novel method that involves extraction of local and global features using CNN-LSTM
framework and weighting them dynamically for script identification. First, we convert the
images
into patches and feed them into a CNN-LSTM framework. Attention-based patch weights are
calculated applying softmax layer after LSTM. Next, we do patch-wise multiplication of these
weights with corresponding CNN to yield local features. Global features are also extracted
from
last cell state of LSTM. We employ a fusion technique which dynamically weights the local and
global features for an individual patch. Experiments have been done in four public script
identification datasets: SIW-13, CVSI2015, ICDAR-17 and MLe2e. The proposed framework achieves
superior results in comparison to conventional methods.
@article{bhunia2019script,
title={Script identification in natural scene image and video frames using an attention based
convolutional-LSTM network},
author={Bhunia, Ankan Kumar and Konwer, Aishik and Bhunia, Ayan Kumar and Bhowmick, Abir and
Roy,
Partha P and Pal, Umapada},
journal={Pattern Recognition},
volume={85},
pages={172--184},
year={2019},
publisher={Elsevier}
}
|

|
|