Dr Ayan Kumar Bhunia | Computer Vision and Machine Learning Researcher

Dr. Ayan Kumar Bhunia

I completed my Doctor of Philosophy (PhD), focusing on Computer Vision and Deep Learning in 2022, from SketchX Lab, Centre for Vision, Speech and Signal Processing (CVSSP) , University of Surrey, England, United Kingdom under the supervision of Prof. Yi-Zhe Song and Prof. Tao(Tony) Xiang .

Currently, I am working as a Senior Research Scientist at Sony PlayStation, London. I specialize in Computer Vision/Deep Learning, contributing to the dynamic field of the gaming industry. Formerly, I played an important role at iSIZE, a London-based deep-tech startup focused on AI for Video Delivery, where I led the Compressed Video Denoising project. During this time, I designed and developed a deep model from the ground up - a low-cost neural solution aimed at eliminating compressed video artifacts. Following iSIZE's acquisition by Sony PlayStation, I transitioned into my current role, continuing to innovate in the Sony PlayStation R&D team, with a focus on Machine Learning.

I am also deeply passionate about leveraging Free-hand Sketch as a user-friendly and interactive medium for a diverse range of computer vision tasks. This includes but is not limited to fine-grained image retrieval, image generation, image editing, 3D Shape generation/editing, object detection (CVPR'23 Top-12 best paper candidate), as well as few-shot learning. I believe in the potential of Free-hand Sketch to serve as an effective and intuitive tool across various applications within the realm of computer vision.

Top-venue Publications (Feb 2024): 25xCVPR, 4xICCV, 5xECCV, 1xSiggraph Asia.

Google Scholar / GitHub / LinkedIn / DBLP

Recent Updates

New!! [Feb 2025]: Two papers got accepted in CVPR'25.

[July 2024]: Two papers got accepted in ECCV'24.

[Feb 2024]: Eight papers got accepted in CVPR'24.

[July 2023]: One work on Sketch-Based 3D Shape Retrieval is accepted in ICCV'23!

[March 2023]: Our paper What Can Human Sketches Do for Object Detection? (CVPR'23) has been selected among 12 award candidates , out of 9155 submissions and 2360 accepted papers in CVPR, 2023 .

[March 2023]: Seven papers got accepted in CVPR 2023.

[Oct 2022]: Defended my PhD Thesis before Prof. Stella Yu and Prof. Adrian Hilton — with No corrections

[July 2022]: Two papers got accepted in ECCV 2022.

[March 2022]: Four papers got accepted in CVPR 2022.

Selected Publications

2024

	Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval Aneeshan Sain, Pinaki Nath Chowdhury, Subhadeep Koley, Ayan Kumar Bhunia, Yi-Zhe Song. European Conference on Computer Vision ( ECCV* ), 2024 (New!)* Abstract / Code / arXiv / BibTex This paper addresses the intricate dynamics of Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) by focusing on the choice of viewpoint during sketch creation. It highlights the challenges faced by existing models when query-sketches differ in viewpoint from target instances. The paper proposes a view-aware system that accommodates both view-agnostic and view-specific tasks. The contributions include leveraging multi-view 2D projections of 3D objects, introducing a customisable cross-modal feature, and validating the approach through extensive experiments. @InProceedings{FreeviewSketching, author = {Aneeshan Sain and Pinaki Nath Chowdhury and Subhadeep Koley and Ayan Kumar Bhunia and Yi-Zhe Song}, title = {Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval}, booktitle = {European Conference on Computer Vision (ECCV)}, month = {June}, year = {2024} }
	Do Generalised Classifiers Really Work on Human Drawn Sketches? Hmrishav Bandyopadhyay, Pinaki Nath Chowdhury, Aneeshan Sain, Subhadeep Koley, Tao Xiang, Ayan Kumar Bhunia, Yi-Zhe Song. European Conference on Computer Vision ( ECCV* ), 2024 (New!)* Abstract / Code / arXiv / BibTex This paper investigates the effectiveness of generalised classifiers on human-drawn sketches, focusing on challenges such as varying levels of abstraction and generalisation across unseen categories. The study adapts CLIP, a large foundation model, to sketch-specific tasks by learning sketch-aware prompts and employing a codebook-driven approach to handle different abstraction levels. The results show that while there are improvements, generalisation across varying sketch abstractions remains an open challenge. @InProceedings{GeneralisedClassifiers, author = {Hmrishav Bandyopadhyay and Pinaki Nath Chowdhury and Aneeshan Sain and Subhadeep Koley and Tao Xiang and Ayan Kumar Bhunia and Yi-Zhe Song}, title = {Do Generalised Classifiers Really Work on Human Drawn Sketches?}, booktitle = {European Conference on Computer Vision (ECCV)}, month = {June}, year = {2024} }
	DemoCaricature: Democratising Caricature Generation with a Rough Sketch Dar-Yen Chan, Ayan Kumar Bhunia, Subhadeep Koley, Pinaki Nath Chowdhury, Aneeshan Sain, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2024 (New!)* Abstract / Code / arXiv / BibTex In this paper, we democratise caricature generation, empowering individuals to effortlessly craft personalised caricatures with just a photo and a conceptual sketch. Our objective is to strike a delicate balance between abstraction and identity, while preserving the creativity and subjectivity inherent in a sketch. To achieve this, we present Explicit Rank-1 Model Editing alongside single-image personalisation, selectively applying nuanced edits to cross-attention layers for a seamless merge of identity and style. Additionally, we propose Random Mask Reconstruction to enhance robustness, directing the model to focus on distinctive identity and style features. Crucially, our aim is not to replace artists but to eliminate accessibility barriers, allowing enthusiasts to engage in the artistry. @InProceedings{DemoCaricature, author = {Dar-Yen Chen and Ayan Kumar Bhunia and Subhadeep Koley and Aneeshan Sain and Pinaki Nath Chowdhury and Yi-Zhe Song}, title = {DemoCaricature: Democratising Caricature Generation with a Rough Sketch}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024} }
	It's All About Your Sketch: Democratising Sketch Control in Diffusion Models Subhadeep Koley, Ayan Kumar Bhunia, Deeptanshu Sekhri, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2024 (New!)* Abstract / Code / arXiv / BibTex This paper unravels the potential of sketches for diffusion models, addressing the deceptive promise of direct sketch control in generative AI. We importantly democratise the process, enabling amateur sketches to generate precise images, living up to the commitment of "what you sketch is what you get". A pilot study underscores the necessity, revealing that deformities in existing models stem from spatial-conditioning. To rectify this, we propose an abstraction-aware framework, utilising a sketch adapter, adaptive time-step sampling, and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model, working synergistically to reinforce fine-grained sketch-photo association. Our approach operates seamlessly during inference without the need for textual prompts; a simple, rough sketch akin to what you and I can create suffices! We welcome everyone to examine results presented in the paper and its supplementary. Contributions include democratising sketch control, introducing an abstraction-aware framework, and leveraging discriminative guidance, validated through extensive experiments. @InProceedings{StableSketching, author = {Subhadeep Koley and Ayan Kumar Bhunia and Deeptanshu Sekhri, and Aneeshan Sain and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {It's All About Your Sketch: Democratising Sketch Control in Diffusion Models}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024} Subhadeep Koley }
	You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2024 (New!)* Abstract / Code / arXiv / BibTex Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks, sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper, we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text, orchestrating a duet between the two. The end result enables precise retrievals previously unattainable, allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose, we introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models, while eliminating the need for extensive fine-grained textual descriptions. Last but not least, our system extends to novel applications in composite image retrieval, domain attribute transfer, and fine-grained generation, providing solutions for various real-world scenarios. @InProceedings{Sketch&TextFGSBIR, author = {Subhadeep Koley and Ayan Kumar Bhunia and Aneeshan Sain and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024} }
	How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval? Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2024 (New!)* Abstract / Code / arXiv / BibTex In this paper, we propose a novel abstraction-aware sketch-based image retrieval framework capable of handling sketch abstraction at varied levels. Prior works had mainly focused on tackling sub-factors such as drawing style and order, we instead attempt to model abstraction as a whole, and propose feature-level and retrieval granularity-level designs so that the system builds into its DNA the necessary means to interpret abstraction. On learning abstraction-aware features, we for the first-time harness the rich semantic embedding of pre-trained StyleGAN model, together with a novel abstraction-level mapper that deciphers the level of abstraction and dynamically selects appropriate dimensions in the feature matrix correspondingly, to construct a feature matrix embedding that can be freely traversed to accommodate different levels of abstraction. For granularity-level abstraction understanding, we dictate that the retrieval model should not treat all abstraction-levels equally and introduce a differentiable surrogate Acc@q loss to inject that understanding into the system. Different to the gold-standard triplet loss, our loss uniquely allows a sketch to narrow/broaden its focus in terms of how stringent the evaluation should be -- the more abstract a sketch, the less stringent (higher ). Extensive experiments depict our method to outperform existing state-of-the-arts in standard SBIR tasks along with challenging scenarios like early retrieval, forensic sketch-photo matching, and style-invariant retrieval. @InProceedings{AbstractAway, author = {Subhadeep Koley and Ayan Kumar Bhunia and Aneeshan Sain and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval?}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024} }
	SketchINR: A First Look into Sketches as Implicit Neural Representations Hmrishav Bandyopadhyay, Ayan Kumar Bhunia , Pinaki Nath Chowdhury, Aneeshan Sain, Tao Xiang , Timothy Hospedales Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2024 (New!)* Abstract / Code / arXiv / BibTex We propose SketchINR, to advance the representation of vector sketches with implicit neural models. A variable length vector sketch is compressed into a latent space of fixed dimension that implicitly encodes the underlying shape as a function of time and strokes. The learned function predicts the xy point coordinates in a sketch at each time and stroke. Despite its simplicity, SketchINR outperforms existing representations at multiple tasks: (i) Encoding an entire sketch dataset into a fixed size latent vector, SketchINR gives 60× and 10× data compression over raster and vector sketches, respectively. (ii) SketchINR's auto-decoder provides a much higher-fidelity representation than other learned vector sketch representations, and is uniquely able to scale to complex vector sketches such as FS-COCO. (iii) SketchINR supports parallelisation that can decode/render ∼100× faster than other learned vector representations such as SketchRNN. (iv) SketchINR, for the first time, emulates the human ability to reproduce a sketch with varying abstraction in terms of number and complexity of strokes. As a first look at implicit sketches, SketchINR's compact high-fidelity representation will support future work in modelling long and complex sketches. @InProceedings{SketchINR, author = {Hmrishav Bandyopadhyay and Ayan Kumar Bhunia and Pinaki Nath Chowdhury and Aneeshan Sain and Tao Xiang and Timothy Hospedales and Yi-Zhe Song}, title = {SketchINR: A First Look into Sketches as Implicit Neural Representations}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024} }
	Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers Subhadeep Koley, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2024 (New!)* Abstract / Code / arXiv / BibTex This paper, for the first time, explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos. This proficiency is underpinned by their robust cross-modal capabilities and shape bias, findings that are substantiated through our pilot studies. In order to harness pre-trained diffusion models effectively, we introduce a straightforward yet powerful strategy focused on two key aspects: selecting optimal feature layers and utilising visual and textual prompts. For the former, we identify which layers are most enriched with information and are best suited for the specific retrieval requirements (category-level or fine-grained). Then we employ visual and textual prompts to guide the model's feature extraction process, enabling it to generate more discriminative and contextually relevant cross-modal representations. Extensive experiments on several benchmark datasets validate significant performance improvements. @InProceedings{DiffusionZSSBIR, author = {Subhadeep Koley and Ayan Kumar Bhunia and Aneeshan Sain and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024} }
	What Sketch Explainability Really Means for Downstream Tasks Hmrishav Bandyopadhyay, Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Aneeshan Sain, Tao Xiang, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2024 (New!)* Abstract / Code / arXiv / BibTex In this paper, we explore the unique modality of sketch for explainability, emphasising the profound impact of human strokes compared to conventional pixel-oriented studies. Beyond explanations of network behavior, we discern the genuine implications of explainability across diverse downstream sketch-related tasks. We propose a lightweight and portable explainability solution -- a seamless plugin that integrates effortlessly with any pre-trained model, eliminating the need for re-training. Demonstrating its adaptability, we present four applications: highly studied retrieval and generation, and completely novel assisted drawing and sketch adversarial attacks. The centrepiece to our solution is a stroke-level attribution map that takes different forms when linked with downstream tasks. By addressing the inherent non-differentiability of rasterisation, we enable explanations at both coarse stroke level (SLA) and partial stroke level (P-SLA), each with its advantages for specific downstream tasks. @InProceedings{DemoCaricature, author = {Hmrishav Bandyopadhyay and Pinaki Nath Chowdhury and Ayan Kumar Bhunia and Aneeshan Sain and Pinaki Nath Chowdhury and Yi-Zhe Song}, title = {What Sketch Explainability Really Means for Downstream Tasks}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024} }
	Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes Hmrishav Bandyopadhyay, Subhadeep Koley, Ayan Das, Ayan Kumar Bhunia, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2024 (New!)* Abstract / Code / arXiv / BibTex In this paper, we democratise 3D content creation, enabling precise generation of 3D shapes from abstract sketches while overcoming limitations tied to drawing skills. We introduce a novel part-level modelling and alignment framework that facilitates abstraction modelling and cross-modal correspondence. Leveraging the same part-level decoder, our approach seamlessly extends to sketch modelling by establishing correspondence between CLIPasso edgemaps and projected 3D part regions, eliminating the need for a dataset pairing human sketches and 3D shapes. Additionally, our method introduces a seamless in-position editing process as a byproduct of cross-modal part-aligned modelling. Operating in a low-dimensional implicit space, our approach significantly reduces computational demands and processing time. @InProceedings{Doodle-to-3D, author = {Hmrishav Bandyopadhyay and Subhadeep Koley and Ayan Das and Ayan Kumar Bhunia and Aneeshan Sain and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes}, booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024} }

2023

	Democratising 2D Sketch to 3D Shape Retrieval Through Pivoting Pinaki Nath Chowdhury, Ayan Kumar Bhunia, Aneeshan Sain, Subhadeep Koley, Tao Xiang , Yi-Zhe Song . International Conference on Computer Vision ( ICCV* )*, 2023. Abstract / Code / arXiv / BibTex Human sketch has already proved its worth in various visual understanding tasks (e.g., retrieval, segmentation, image-captioning, etc). In this paper, we reveal a new trait of sketches -- that they are also salient. This is intuitive as sketching is a natural attentive process at its core. More specifically, we aim to study how sketches can be used as a weak label to detect salient objects present in an image. To this end, we propose a novel method that emphasises on how “salient object” could be explained by hand-drawn sketches. To accomplish this, we introduce a photo-to-sketch generation model that aims to generate sequential sketch coordinates corresponding to a given visual photo through a 2D attention mechanism. Attention maps accumulated across the time steps give rise to salient regions in the process. Extensive quantitative and qualitative experiments prove our hypothesis and delineate how our sketch-based saliency detection model gives a competitive performance compared to the state-of-the-art. @InProceedings{Sketch2Saliency, author = {Pinaki Nath Chowdhury and Ayan Kumar Bhunia and Aneeshan Sain and Subhadeep Koley and Tao Xiang and Yi-Zhe Song}, title = {Democratising 2D Sketch to 3D Shape Retrieval Through Pivoting}, booktitle = {International Conference on Computer Vision (ICCV)}, month = {Oct}, year = {2023} }
	Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings Ayan Kumar Bhunia , Subhadeep Koley, Amandeep Kumar, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2023 Abstract / Code / arXiv / BibTex Human sketch has already proved its worth in various visual understanding tasks (e.g., retrieval, segmentation, image-captioning, etc). In this paper, we reveal a new trait of sketches -- that they are also salient. This is intuitive as sketching is a natural attentive process at its core. More specifically, we aim to study how sketches can be used as a weak label to detect salient objects present in an image. To this end, we propose a novel method that emphasises on how “salient object” could be explained by hand-drawn sketches. To accomplish this, we introduce a photo-to-sketch generation model that aims to generate sequential sketch coordinates corresponding to a given visual photo through a 2D attention mechanism. Attention maps accumulated across the time steps give rise to salient regions in the process. Extensive quantitative and qualitative experiments prove our hypothesis and delineate how our sketch-based saliency detection model gives a competitive performance compared to the state-of-the-art. @InProceedings{Sketch2Saliency, author = {Ayan Kumar Bhunia and Subhadeep Koley and Amandeep Kumar and Aneeshan Sain and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023} }
	Picture that Sketch: Photorealistic Image Generation from Abstract Sketches Subhadeep Koley, Ayan Kumar Bhunia , Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2023 Abstract / Code / arXiv / BibTex Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image - just like those shown in Fig. 1(a), all non-cherry-picked. We differ significantly to prior art in that we do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches. In doing so, we essentially democratise the sketch-to-photo pipeline, ''picturing'' a sketch regardless of how good you sketch. Our contribution at the outset is a decoupled encoder-decoder training paradigm, where the decoder is a StyleGAN trained on photos only. This importantly ensures generated results are always photorealistic. The rest is then all centred around how best to deal with the abstraction gap between sketch and photo. For that, we propose an autoregressive sketch mapper trained on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We further introduce specific designs to tackle the abstract nature of human sketches, including a fine-grained discriminative loss on the back of a trained sketch-photo retrieval model, and a partial-aware sketch augmentation strategy. Finally, we showcase a few downstream tasks our generation model enables, amongst them is showing how fine-grained sketch-based image retrieval, a well-studied problem in the sketch community, can be reduced to an image (generated) to image retrieval task, surpassing state-of-the-arts. We put forward generated results in the supplementary for everyone to scrutinise. @InProceedings{PictureThatSketch, author = {Subhadeep Koley and Ayan Kumar Bhunia and Aneeshan Sain and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {Picture that Sketch: Photorealistic Image Generation from Abstract Sketches}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023} }
	SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text Pinaki Nath Chowdhury, Ayan Kumar Bhunia , Aneeshan Sain, Subhadeep Koley, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2023 Abstract / Code / arXiv / BibTex In this paper, we extend scene understanding to include that of human sketch. The result is a complete trilogy of scene representation from three diverse and complementary {modalities} -- sketch, photo, and text. Instead of learning a rigid three-way embedding and be done with it, we focus on learning a flexible joint embedding that fully supports the ``optionality" that this complementarity brings. Our embedding supports optionality on two axis: (i) optionality across modalities -- use any combination of modalities as query for downstream tasks like retrieval, (ii) optionality across tasks -- simultaneously utilising the embedding for either discriminative (e.g., retrieval) or generative tasks (e.g., captioning). This provides flexibility to end-users by exploiting the best of each modality, therefore serving the very purpose behind our proposal of a trilogy at the first place. First, a combination of information-bottleneck and conditional invertible neural networks disentangle the modality-specific component from modality-agnostic in sketch, photo, and text. Second, the modality-agnostic instances from sketch, photo, and text are synergised using a modified cross-attention. Once learned, we show our embedding can accommodate a multi-facet of scene-related tasks, including those enabled for the first time by the inclusion of sketch, all without any task-specific modifications. @InProceedings{SceneTrilogy, author = {Pinaki Nath Chowdhury and Ayan Kumar Bhunia and Aneeshan Sain and Subhadeep Koley and Tao Xiang and Yi-Zhe Song}, title = {SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023} }
	Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR Aneeshan Sain, Ayan Kumar Bhunia , Subhadeep Koley, Pinaki Nath Chowdhury, Soumitri Chattopadhyay, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2023. Abstract / Code / arXiv / BibTex This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the art by ~11%. This is not via complicated design though, but by addressing two critical issues facing the community (i) the gold standard triplet loss does not enforce holistic latent space geometry, and (ii) there are never enough sketches to train a high accuracy model. For the former, we propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances. For the latter, we put forward a novel knowledge distillation module can leverage photo data for model training. Both modules are then plugged into a novel plug-n-playable training paradigm that allows for more stable training. More specifically, for (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances while bringing closer a structurally augmented version of the same photo (offering a gain of 4-6%). To tackle (ii), we first pre-train a teacher on the large set of unlabelled photos over the aforementioned intra-modal photo triplet loss. Then we distill the contextual similarity present amongst the instances in the teacher's embedding space to that in the student's embedding space, by matching the distribution over inter-feature distances of respective samples in both embedding spaces (delivering a further gain of 4-5%). Apart from outperforming prior arts significantly, our model also yields satisfactory results on generalising to new classes. @InProceedings{Stronger_FGSBIR, author = {Aneeshan Sain and Ayan Kumar Bhunia and Subhadeep Koley and Pinaki Nath Chowdhury and Soumitri Chattopadhyay and Tao Xiang and Yi-Zhe Song}, title = {Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023} }
	What Can Human Sketches Do for Object Detection? Pinaki Nath Chowdhury, Ayan Kumar Bhunia , Aneeshan Sain, Subhadeep Koley, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2023 [Top 12 Award Candidates]* Abstract / Code / arXiv / BibTex Sketches are highly expressive, inherently capturing subjective and fine-grained visual cues. The exploration of such innate properties of human sketches has however been limited to that of image retrieval. In this paper, for the first time, we cultivate the expressiveness of sketches, but for the fundamental vision task of object detection. The end result is sketch-enabled object detection framework that detects based on what you sketch -- that ``zerbra'' (e.g., one that is eating the grass) in a herd of zebras (instance-ware detection), and only the part (e.g., ``head" of a ``zebra") that you desire (part-aware detection). We further dictate that our model works without (i) knowing which category to expect at testing (zero-shot), and (ii) not requiring additional bounding boxes (as per supervised), and class labels (as per weakly supervised). Instead of devising a model from the ground up, we show an intuitive synergy between foundation models (e.g., CLIP) and existing sketch models build for sketch-based image retrieval (SBIR), can already elegantly solve the task -- CLIP to provide model generalisation, and SBIR to bridge the (sketch photo) gap. In particular, we first perform independent prompting on both sketch and photo branches of a SBIR model, to build highly generalisable sketch and photo encoders on the back of the generalisation ability of CLIP. We then devise a training paradigm to adapt the learned encoders for object detection, such that the region embeddings of detected boxes are aligned with the sketch and photo embeddings from SBIR. Evaluating our framework on standard object detection datasets like PASCAL-VOC and MS-COCO outperforms both supervised (SOD) and weakly supervise. @InProceedings{Sketch_FineGrained_ObjectDet, author = {Pinaki Nath Chowdhury and Ayan Kumar Bhunia and Aneeshan Sain and Subhadeep Koley and Tao Xiang and Yi-Zhe Song}, title = {What Can Human Sketches Do for Object Detection?}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023} }
	CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not Aneeshan Sain, Ayan Kumar Bhunia , Pinaki Nath Chowdhury, Subhadeep Koley, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2023 Abstract / Code / arXiv / BibTex In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting (''all''). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) -- a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. @InProceedings{ZSSBIR_CLIP, author = { Aneeshan Sain and Ayan Kumar Bhunia and Pinaki Nath Chowdhury and Aneeshan Sain and Subhadeep Koley and Tao Xiang and Yi-Zhe Song}, title = { CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023} }
	Data-Free Sketch-Based Image Retrieval Abhra Chaudhuri, Ayan Kumar Bhunia , Yi-Zhe Song, Anjan Dutta . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2023 Abstract / Code / arXiv / BibTex Rising concerns about privacy and anonymity preservation of deep learning models have facilitated research in data-free learning. Primarily based on data-free knowledge distillation, models developed in this area so far have only been able to operate in a single modality, performing the same kind of task as that of the teacher. For the first time, we propose Data-Free Sketch-Based Image Retrieval (DF-SBIR), a cross-modal data-free learning setting, where teachers trained for classification in a single modality have to be leveraged by students to learn a cross-modal metric-space for retrieval. The widespread availability of pre-trained classification models, along with the difficulty in acquiring paired photo-sketch datasets for SBIR justify the practicality of this setting. We present a methodology for DF-SBIR, which can leverage knowledge from models independently trained to perform classification on photos and sketches. We perform extensive empirical evaluation of our model on the Sketchy and TU-Berlin benchmarks, designing a variety of baselines based on existing data-free learning literature, and observe that our method surpasses all of them by significant margins. Our method also achieves mAPs competitive with data-dependent approaches, all the while requiring no training data. Implementation will be made publicly available upon acceptance. @InProceedings{DataFree_SBIR, author = { Abhra Chaudhuri and Ayan Kumar Bhunia and Yi-Zhe Song and Anjan Dutta}, title = {Data-Free Sketch-Based Image Retrieval}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023} }

2022

	FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context Pinaki Nath Chowdhury, Aneeshan Sain, Yulia Gryaditskaya, Ayan Kumar Bhunia , Tao Xiang , Yi-Zhe Song . European Conference on Computer Vision( ECCV* )*, 2022 Abstract / Code / arXiv / BibTex We advance sketch research to scenes with the first dataset of freehand scene sketches, FS-COCO. With practical applications in mind, we collect sketches that convey well scene content but can be sketched within a few minutes by a person with any sketching skills. Our dataset comprises 10,000 freehand scene vector sketches with per point space-time information by 100 non-expert individuals, offering both object- and scene-level abstraction. Each sketch is augmented with its text description. Using our dataset, we study for the first time the problem of fine-grained image retrieval from freehand scene sketches and sketch captions. We draw insights on: (i) Scene salience encoded in sketches using the strokes temporal order; (ii) Performance comparison of image retrieval from a scene sketch and an image caption; (iii) Complementarity of information in sketches and image captions, as well as the potential benefit of combining the two modalities. In addition, we extend a popular vector sketch LSTM-based encoder to handle sketches with larger complexity than was supported by previous work. Namely, we propose a hierarchical sketch decoder, which we leverage at a sketch-specific “pre-text” task. Our dataset enables for the first time research on freehand scene sketch understanding and its practical applications @InProceedings{FSCOCO, author = {Pinaki Nath Chowdhury and Aneeshan Sain and Yulia Gryaditskaya and Ayan Kumar Bhunia and and Tao Xiang and Yi-Zhe Song}, title = {FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context}, booktitle = {ECCV}, month = {October}, year = {2022} }
	Adaptive Fine-Grained Sketch-Based Image Retrieval Ayan Kumar Bhunia , Aneeshan Sain, Parth Hiren Shah, Animesh Gupta, Pinaki Nath Chowdhury, Tao Xiang , Yi-Zhe Song . European Conference on Computer Vision( ECCV* )*, 2022 Abstract / Code / arXiv / BibTex The recent focus on Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) has shifted towards generalising a model to new categories without any training data from them. In real-world applications, however, a trained FG-SBIR model is often applied to both new categories and different human sketchers, i.e., different drawing styles. Although this complicates the generalisation problem, fortunately, a handful of examples are typically available, enabling the model to adapt to the new category/style. In this paper, we offer a novel perspective -- instead of asking for a model that generalises, we advocate for one that quickly adapts, with just very few samples during testing (in a few-shot manner). To solve this new problem, we introduce a novel model-agnostic meta-learning (MAML) based framework with several key modifications: (1) As a retrieval task with a margin-based contrastive loss, we simplify the MAML training in the inner loop to make it more stable and tractable. (2) The margin in our contrastive loss is also meta-learned with the rest of the model. (3) Three additional regularisation losses are introduced in the outer loop, to make the meta-learned FG-SBIR model more effective for category/style adaptation. Extensive experiments on public datasets suggest a large gain over generalisation and zero-shot based approaches, and a few strong few-shot baselines.. @InProceedings{adaptivefgsbir, author = {Ayan Kumar Bhunia and Aneeshan Sain and Parth Hiren Shah and Animesh Gupta and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {Adaptive Fine-Grained Sketch-Based Image Retrieval}, booktitle = {ECCV}, month = {October}, year = {2022} }
	Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches Ayan Kumar Bhunia , Viswanatha Reddy Gajjala, Subhadeep Koley, Rohit Kundu, Aneeshan Sain, Tao Xiang , Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2022 Abstract / Code / arXiv / Marktechpost Blog / BibTex The human visual system is remarkable in learning new visual concepts from just a few examples. This is precisely the goal behind few-shot class incremental learning (FSCIL), where the emphasis is additionally placed on ensuring the model does not suffer from ``forgetting''. {In this paper, we push the boundary further for FSCIL by addressing} two key questions that bottleneck its ubiquitous application (i) can the model learn from diverse modalities other than just photo (as humans do), and (ii) what if photos are not readily accessible (due to ethical and privacy constraints). Our key innovation lies in advocating the use of sketches as a new modality for class support. The product is a ``Doodle It Yourself" (DIY) FSCIL framework where the users can freely sketch a few examples of a novel class for the model to learn to recognize photos of that class. For that, we present a framework that infuses (i) gradient consensus for domain invariant learning, (ii) knowledge distillation for preserving old class information, and (iii) graph attention networks for message passing between old and novel classes. We experimentally show that sketches are better class support than text in the context of FSCIL, echoing findings elsewhere in the sketching literature. @InProceedings{DoodleIncremental, author = {Ayan Kumar Bhunia and Viswanatha Reddy Gajjala and Subhadeep Koley and Rohit Kundu and Aneeshan Sain and Tao Xiang and Yi-Zhe Song}, title = {Doodle It Yourself: Class Incremental Learning by Drawing a Few Sketches}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022} }
	Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval Ayan Kumar Bhunia , Subhadeep Koley, Abdullah Faiz Ur Rahman Khilji , Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2022 Abstract / Code / arXiv / BibTex Sketching enables many exciting applications, notably, image retrieval. The fear-to-sketch problem (i.e., ``I can't sketch") has however proven to be fatal for its widespread adoption. This paper tackles this ``fear" head on, and for the first time, proposes an auxiliary module for existing retrieval models that predominantly lets the users sketch without having to worry. We first conducted a pilot study that revealed the secret lies in the existence of noisy strokes, but not so much of the ``I can't sketch". We consequently design a stroke subset selector that {detects noisy strokes, leaving only those} which make a positive contribution towards successful retrieval. Our Reinforcement Learning based formulation quantifies the importance of each stroke present in a given subset, based on the extend to which that stroke contributes to retrieval. When combined with pre-trained retrieval models as a pre-processing module, we achieve a significant gain of 8\%-10\% over standard baselines and in turn report new state-of-the-art performance. Last but not least, we demonstrate the selector once trained, can also be used in a plug-and-play manner to empower various sketch applications in ways that were not previously possible. @InProceedings{strokesubset, author = {Ayan Kumar Bhunia and Subhadeep Koley and Abdullah Faiz Ur Rahman Khilji and Aneeshan Sain and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {Sketching without Worrying: Noise-Tolerant Sketch-Based Image Retrieval}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022} }
	Partially Does It: Towards Scene-Level FG-SBIR with Partial Input Pinaki Nath Chowdhury, Ayan Kumar Bhunia , Viswanatha Reddy Gajjala, Aneeshan Sain, Tao Xiang, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2022 Abstract / Code / arXiv / BibTex We scrutinise an important observation plaguing scene-level sketch research -- that a significant portion of scene sketches are ``partial". A quick pilot study reveals: (i) a scene sketch does not necessarily contain all objects in the corresponding photo, due to the subjective holistic interpretation of scenes, (ii) there exists significant empty (white) regions as a result of object-level abstraction, and as a result, (iii) existing scene-level fine-grained sketch-based image retrieval methods collapse as scene sketches become more partial. To solve this ``partial" problem, we advocate for a simple set-based approach using optimal transport (OT) to model cross-modal region associativity in a partially-aware fashion. Importantly, we improve upon OT to further account for holistic partialness by comparing intra-modal adjacency matrices. Our proposed method is not only robust to partial scene-sketches but also yields state-of-the-art performance on existing datasets. @InProceedings{PartialSBIR, author = {Pinaki Nath Chowdhury and Ayan Kumar Bhunia and Viswanatha Reddy Gajjala and Aneeshan Sain and Tao Xiang and Yi-Zhe Song}, title = {Partially Does It: Towards Scene-Level FG-SBIR with Partial Input}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022} }
	Sketch3T: Test-time Training for Zero-Shot SBIR Aneeshan Sain, Ayan Kumar Bhunia , Vaishnav Potlapalli , Pinaki Nath Chowdhury , Tao Xiang, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2022 Abstract / Code / arXiv / BibTex Zero-shot sketch-based image retrieval typically asks for a trained model to be applied as is to unseen categories. In this paper, we question to argue that this setup by definition is not compatible with the inherent abstract and subjective nature of sketches -- the model might transfer well to new categories, but will not understand sketches existing in different test-time distribution as a result. We thus extend ZS-SBIR asking it to transfer to both categories and sketch distributions. Our key contribution is a test-time training paradigm that can adapt using just one sketch. Since there is no paired photo, we make use of a sketch raster-vector reconstruction module as a self-supervised auxiliary task. To maintain the fidelity of the trained cross-modal joint embedding during test-time update, we design a novel meta-learning based training paradigm to learn a separation between model updates incurred by this auxiliary task from those off the primary objective of discriminative learning. Extensive experiments show our model to outperform state-of-the-arts, thanks to the proposed test-time adaption that not only transfers to new categories but also accommodates to new sketching styles. @InProceedings{Sketch3T, author = {Aneeshan Sain and Ayan Kumar Bhunia and Vaishnav Potlapalli and Pinaki Nath Chowdhury and Tao Xiang and Yi-Zhe Song}, title = {Sketch3T: Test-time Training for Zero-Shot SBIR}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022} }

2021

	Text is Text, No Matter What: Unifying Text Recognition using Knowledge Distillation Ayan Kumar Bhunia , Aneeshan Sain, Pinaki Nath Chowdhury, Yi-Zhe Song . IEEE International Conference on Computer Vision ( ICCV* )*, 2021 Abstract / arXiv / BibTex Text recognition remains a fundamental and extensively researched topic in computer vision, largely owing to its wide array of commercial applications. The challenging nature of the very problem however dictated a fragmentation of research efforts: Scene Text Recognition (STR) that deals with text in everyday scenes, and Handwriting Text Recognition (HTR) that tackles hand-written text. In this paper, for the first time, we argue for their unification -- we aim for a single model that can compete favourably with two separate state-of-the-art STR and HTR models. We first show that cross-utilisation of STR and HTR models trigger significant performance drops due to differences in their inherent challenges. We then tackle their union by introducing a knowledge distillation (KD) based framework. This is however non-trivial, largely due to the variable-length and sequential nature of text sequences, which renders off-the-shelf KD techniques that mostly works with global fixed length data inadequate. For that, we propose three distillation losses all of which specifically designed to cope with the aforementioned unique characteristics of text recognition. Empirical evidence suggests that our proposed unified model performs on par with individual models, even surpassing them in certain cases. Ablative studies demonstrates that naive baselines such as a two-stage framework, and domain adaption/generalisation alternatives do not work as well, further verifying the appropriateness of our design. @InProceedings{textistext, author = {Ayan Kumar Bhunia and Aneeshan Sain and Pinaki Nath Chowdhury and Yi-Zhe Song}, title = {Text is Text, No Matter What: Unifying Text Recognition using Knowledge Distillation}, booktitle = {The IEEE International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021} }
	Towards the Unseen: Iterative Text Recognition by Distilling from Errors Ayan Kumar Bhunia , Pinaki Nath Chowdhury, Aneeshan Sain, Yi-Zhe Song . IEEE International Conference on Computer Vision ( ICCV* )*, 2021 Abstract / arXiv / BibTex Visual text recognition is undoubtedly one of the most extensively researched topics in computer vision. Great progress have been made to date, with the latest models starting to focus on the more practical ``in-the-wild'' setting. However, a salient problem still hinders practical deployment -- prior arts mostly struggle with recognising unseen (or rarely seen) character sequences. In this paper, we put forward a novel framework to specifically tackle this ``unseen'' problem. Our framework is iterative in nature, in that it utilises predicted knowledge of character sequences from a previous iteration, to augment the main network in improving the next prediction. Key to our success is a unique cross-modal variational autoencoder to act as a feedback module, which is trained with the presence of textual error distribution data. This module importantly translate a discrete predicted character space, to a continuous affine transformation parameter space used to condition the visual feature map at next iteration. Experiments on common datasets have shown competitive performance over state-of-the-arts under the conventional setting. Most importantly, under the new disjoint setup where train-test labels are mutually exclusive, ours offers the best performance thus showcasing the capability of generalising onto unseen words. @InProceedings{towardsunseen, author = {Ayan Kumar Bhunia and Pinaki Nath Chowdhury and Aneeshan Sain and Yi-Zhe Song}, title = {Towards the Unseen: Iterative Text Recognition by Distilling from Errors}, booktitle = {The IEEE International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021} }
	Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition Ayan Kumar Bhunia , Aneeshan Sain, Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, Yi-Zhe Song . IEEE International Conference on Computer Vision ( ICCV* )*, 2021 Abstract / arXiv / BibTex Although text recognition has significantly evolved over the years, state-of the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artifacts. This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities. In this paper, we argue that semantic information offers a complimentary role in addition to visual only. More specifically, we additionally utilize semantic information by proposing a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning. Our novelty lies in the intuition that for text recognition, prediction should be refined in a stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling attentional decoder where non-differentiability, invoked by discretely predicted character labels, needs to be bypassed for end-to-end training. While the first stage predicts using visual features, subsequent stages refine on-top of it using joint visual-semantic information. Additionally, we introduce multi-scale 2D attention along with dense and residual connections between different stages to deal with varying scales of character sizes, for better performance and faster convergence during training. Experimental results show our approach to outperform existing SOTA methods by a considerable margin. @InProceedings{JVSR, author = {Ayan Kumar Bhunia and Aneeshan Sain and Amandeep Kumar and Shuvozit Ghose and Pinaki Nath Chowdhury and Yi-Zhe Song}, title = {Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition}, booktitle = {The IEEE International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021} }
	Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting Ayan Kumar Bhunia , Pinaki Nath Chowdhury, Yongxin Yang, Timothy Hospedales, Tao Xiang, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2021 Abstract / Code / arXiv / BibTex Self-supervised learning has gained prominence due to its efficacy at learning powerful representations from unlabelled data that achieve excellent performance on many challenging downstream tasks. However supervision-free pre-text tasks are challenging to design and usually modality specific. Although there is a rich literature of self-supervised methods for either spatial (such as images) or temporal data (sound or text) modalities, a common pre-text task that benefits both modalities is largely missing. In this paper, we are interested in defining a self-supervised pre-text task for sketches and handwriting data. This data is uniquely characterised by its existence in dual modalities of rasterized images and vector coordinate sequences. We address and exploit this dual representation by proposing two novel cross-modal translation pre-text tasks for self-supervised feature learning: Vectorization and Rasterization. Vectorization learns to map image space to vector coordinates and rasterization maps vector coordinates to image space. We show that the our learned encoder modules benefit both raster-based and vector-based downstream approaches to analysing hand-drawn data. Empirical evidence shows that our novel pre-text tasks surpass existing single and multi-modal self-supervision methods. @InProceedings{sketch2vec, author = {Ayan Kumar Bhunia and Pinaki Nath Chowdhury and Yongxin Yang and Timothy Hospedales and Tao Xiang and Yi-Zhe Song}, title = {Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2021} }
	More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval Ayan Kumar Bhunia , Pinaki Nath Chowdhury, Aneeshan Sain, Yongxin Yang, Tao Xiang, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2021 Abstract / Code / arXiv / BibTex A fundamental challenge faced by existing Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) models is the data scarcity -- model performances are largely bottlenecked by the lack of sketch-photo pairs. Whilst the number of photos can be easily scaled, each corresponding sketch still needs to be individually produced. In this paper, we aim to mitigate such an upper-bound on sketch data, and study whether unlabelled photos alone (of which they are many) can be cultivated for performances gain. In particular, we introduce a novel semi-supervised framework for cross-modal retrieval that can additionally leverage large-scale unlabelled photos to account for data scarcity. At the centre of our semi-supervision design is a sequential photo-to-sketch generation model that aims to generate paired sketches for unlabelled photos. Importantly, we further introduce a discriminator guided mechanism to guide against unfaithful generation, together with a distillation loss based regularizer to provide tolerance against noisy training samples. Last but not least, we treat generation and retrieval as two conjugate problems, where a joint learning procedure is devised for each module to mutually benefit from each other. Extensive experiments show that our semi-supervised model yields significant performance boost over the state-of-the-art supervised alternatives, as well as existing methods that can exploit unlabelled photos for FG-SBIR. @InProceedings{semi-fgsbir, author = {Ayan Kumar Bhunia and Pinaki Nath Chowdhury and Aneeshan Sain and Yongxin Yang and Tao Xiang and Yi-Zhe Song}, title = {More Photos are All You Need: Semi-Supervised Learning for Fine-Grained Sketch Based Image Retrieval}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2021} }
	StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval Aneeshan Sain, Ayan Kumar Bhunia , Yongxin Yang and , Tao Xiang, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2021 Abstract / arXiv / BibTex Sketch-based image retrieval (SBIR) is a cross-modal matching problem which is typically solved by learning a joint embedding space where the semantic content shared between the photo and sketch modalities are preserved. However, a fundamental challenge in SBIR has been largely ignored so far, that is, sketches are drawn by humans and considerable style variations exist between different users. An effective SBIR model needs to explicitly account for this style diversity, and crucially to generalise to unseen user styles. To this end, a novel style-agnostic SBIR model is proposed. Different from existing models, a cross-modal variational autoencoder (VAE) is employed to explicitly disentangle each sketch into a semantic content part shared with the corresponding photo and a style part unique to the sketcher. Importantly, to make our model dynamically adaptable to any unseen user styles, we propose to meta-train our cross-modal VAE by adding two style-adaptive components: a set of feature transformation layers to its encoder and a regulariser to the disentangled semantic content latent code. With this meta-learning framework, our model can not only disentangle the cross-modal shared semantic content for SBIR, but can adapt the disentanglement to any unseen user styles as well, making the SBIR model truly style-agnostic. Extensive experiments show that our style-agnostic model yields state-of-the-art performance for both category-level and instance-level SBIR. @InProceedings{stylemeup, author = {Aneeshan Sain and Ayan Kumar Bhunia and Yongxin Yang and Tao Xiang and Yi-Zhe Song}, title = {StyleMeUp: Towards Style-Agnostic Sketch-Based Image Retrieval}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2021} }
	MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition Ayan Kumar Bhunia , Shuvozit Ghose, Amandeep Kumar, Pinaki Nath Chowdhury, Aneeshan Sain, Yi-Zhe Song . IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2021 Abstract / arXiv / BibTex Handwritten Text Recognition (HTR) remains a challenging problem to date, largely due to the varying writing styles that exist amongst us. Prior works however generally operate with the assumption that there is a limited number of styles, most of which have already been captured by existing datasets. In this paper, we take a completely different perspective -- we work on the assumption that there is always a new style that is drastically different, and that we will only have very limited data during testing to perform adaptation. This results in a commercially viable solution -- the model has the best shot at adaptation being exposed to the new style, and the few samples nature makes it practical to implement. We achieve this via a novel meta-learning framework which exploits additional new-writer data through a support set, and outputs a writer-adapted model via single gradient step update, all during inference. We discover and leverage on the important insight that there exists few key characters per writer that exhibit relatively larger style discrepancies. For that, we additionally propose to meta-learn instance specific weights for a character-wise cross-entropy loss, which is specifically designed to work with the sequential nature of text data. Our writer-adaptive MetaHTR framework can be easily implemented on the top of most state-of-the-art HTR models. Experiments show an average performance gain of 5-7% can be obtained by observing very few new style data. We further demonstrate via a set of ablative studies the advantage of our meta design when compared with alternative adaption mechanisms. @InProceedings{metahtr, author = {Ayan Kumar Bhunia and Shuvozit Ghose, Amandeep Kumar and Pinaki Nath Chowdhury and Aneeshan Sain and Yi-Zhe Song}, title = {MetaHTR: Towards Writer-Adaptive Handwritten Text Recognition}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2021} }

2020

	Pixelor: A Competitive Sketching AI Agent. So you think you can beat me? Ayan Kumar Bhunia* , Ayan Das, Umar Riaz Muhammad, Yongxin Yang, Timothy M. Hospedalis, Tao Xiang, Yulia Gryaditskaya, Yi-Zhe Song . SIGGRAPH Asia , 2020. Abstract / Code / arXiv / BibTex / Try Online Demo (*equal contribution) We present the first competitive drawing agent Pixelor that exhibits human-level performance at a Pictionary-like sketching game, where the participant whose sketch is recognized first is a winner. Our AI agent can autonomously sketch a given visual concept, and achieve a recognizable rendition as quickly or faster than a human competitor. The key to victory for the agent is to learn the optimal stroke sequencing strategies that generate the most recognizable and distinguishable strokes first. Training Pixelor is done in two steps. First, we infer the optimal stroke order that maximizes early recognizability of human training sketches. Second, this order is used to supervise the training of a sequence-to-sequence stroke generator. Our key technical contributions are a tractable search of the exponential space of orderings using neural sorting; and an improved Seq2Seq Wasserstein (S2S-WAE) generator that uses an optimal-transport loss to accommodate the multi-modal nature of the optimal stroke distribution. Our analysis shows that Pixelor is better than the human players of the Quick, Draw! game, under both AI and human judging of early recognition. To analyze the impact of human competitors’ strategies, we conducted a further human study with participants being given unlimited thinking time and training in early recognizability by feedback from an AI judge. The study shows that humans do gradually improve their strategies with training, but overall Pixelor still matches human performance. We will release the code and the dataset, optimized for the task of early recognition, upon acceptance. @InProceedings{sketchxpixelor, author = {Ayan Kumar Bhunia and Ayan Das and Umar Riaz Muhammad and Yongxin Yang and Timothy M. Hospedales and Tao Xiang and Yulia Gryaditskaya and Yi-Zhe Song}, title = {Pixelor: A Competitive Sketching AI Agent. So you think you can beat me?}, booktitle = {Siggraph Asia}, month = {November}, year = {2020} }
	Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval Aneeshan Sain, Ayan Kumar Bhunia , Yongxin Yang, Tao Xiang, Yi-Zhe Song . British Machine Vision Conference ( BMVC* ), 2020. Abstract / arXiv / BibTex (Oral Presentation)* Sketch as an image search query is an ideal alternative to text in capturing the fine-grained visual details. Prior successes on fine-grained sketch-based image retrieval (FG-SBIR) have demonstrated the importance of tackling the unique traits of sketches as opposed to photos, e.g., temporal vs. static, strokes vs. pixels, and abstract vs. pixel-perfect. In this paper, we study a further trait of sketches that has been overlooked to date, that is, they are hierarchical in terms of the levels of detail -- a person typically sketches up to various extents of detail to depict an object. This hierarchical structure is often visually distinct. In this paper, we design a novel network that is capable of cultivating sketch-specific hierarchies and exploiting them to match sketch with photo at corresponding hierarchical levels. In particular, features from a sketch and a photo are enriched using cross-modal co-attention, coupled with hierarchical node fusion at every level to form a better embedding space to conduct retrieval. Experiments on common benchmarks show our method to outperform state-of-the-arts by a significant margin. @InProceedings{sain2020crossmodal, author = {Aneeshan Sain and Ayan Kumar Bhunia and Yongxin Yang and Tao Xiang and Yi-Zhe Song}, title = {Cross-Modal Hierarchical Modelling for Fine-Grained Sketch Based Image Retrieval}, booktitle = {BMVC}, month = {September}, year = {2020} }
	Fine-grained visual classification via progressive multi-granularity training of jigsaw patches Ruoyi Du, Dongliang Chang, Ayan Kumar Bhunia , Jiyang Xie, Zhanyu Ma, Yi-Zhe Song , Jun Guo . European Conference on Computer Vision ( ECCV* )*, 2020. Abstract / Code/ arXiv / BibTex Fine-grained visual classiﬁcation (FGVC) is much more challenging than traditional classiﬁcation tasks due to the inherently subtle intra-class object variations. Recent works are mainly part-driven (either explicitly or implicitly), with the assumption that fine-grained information naturally rests within the parts. In this paper, we take a different stance, and show that part operations are not strictly necessary -- the key lies with encouraging the network to learn at different granularities and progressively fusing multi-granularity features together. In particular, we propose: (i) a progressive training strategy that effectively fuses features from different granularities, and (ii) a random jigsaw patch generator that encourages the network to learn features at specific granularities. We evaluate on several standard FGVC benchmark datasets, and show the proposed method consistently outperforms existing alternatives or delivers competitive results. @InProceedings{du2020fine, author = {Du, Ruoyi and Chang, Dongliang and Bhunia, Ayan Kumar and Xie, Jiyang and Song, Yi-Zhe and Ma, Zhanyu and Guo, Jun}, title = {Fine-grained visual classification via progressive multi-granularity training of jigsaw patches}, booktitle = {ECCV}, month = {August}, year = {2020} }
	Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image Retrieval Ayan Kumar Bhunia , Yongxin Yang, Timothy M. Hospedalis, Tao Xiang, Yi-Zhe Song. IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* ), 2020. Abstract / Code / arXiv / BibTex (Oral Presentation)* Fine-grained sketch-based image retrieval (FG-SBIR) addresses the problem of retrieving a particular photo instance given a user's query sketch. Its widespread applicability is however hindered by the fact that drawing a sketch takes time, and most people struggle to draw a complete and faithful sketch. In this paper, we reformulate the conventional FG-SBIR framework to tackle these challenges, with the ultimate goal of retrieving the target photo with the least number of strokes possible. We further propose an on-the-fly design that starts retrieving as soon as the user starts drawing. To accomplish this, we devise a reinforcement learning based cross-modal retrieval framework that directly optimizes rank of the ground-truth photo over a complete sketch drawing episode. Additionally, we introduce a novel reward scheme that circumvents the problems related to irrelevant sketch strokes, and thus provides us with a more consistent rank list during the retrieval. We achieve superior early-retrieval efficiency over state-of-the-art methods and alternative baselines on two publicly available fine-grained sketch retrieval datasets. @InProceedings{bhunia2020sketch, author = {Ayan Kumar Bhunia and Yongxin Yang and Timothy M. Hospedales and Tao Xiang and Yi-Zhe Song}, title = {Sketch Less for More: On-the-Fly Fine-Grained Sketch Based Image Retrieval}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2020} }

2019

	Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning Ayan Kumar Bhunia , Abhirup Das, Ankan Kumar Bhunia, Perla Sai Raj Kishore, Partha Pratim Roy. IEEE Conference on Computer Vision and Pattern Recognition ( CVPR* )*, 2019 Abstract / Code / arXiv / BibTex Handwritten Word Recognition and Spotting is a challenging field dealing with handwritten text possessing irregular and complex shapes. The design of deep neural network models makes it necessary to extend training datasets in order to introduce variations and increase the number of samples; word-retrieval is therefore very difficult in low-resource scripts. Much of the existing literature comprises preprocessing strategies which are seldom sufficient to cover all possible variations. We propose an Adversarial Feature Deformation Module (AFDM) that learns ways to elastically warp extracted features in a scalable manner. The AFDM is inserted between intermediate layers and trained alternatively with the original framework, boosting its capability to better learn highly informative features rather than trivial ones. We test our meta-framework, which is built on top of popular word-spotting and word-recognition frameworks and enhanced by AFDM, not only on extensive Latin word datasets but also on sparser Indic scripts. We record results for varying sizes of training data, and observe that our enhanced network generalizes much better in the low-data regime; the overall word-error rates and mAP scores are observed to improve as well. @InProceedings{Bhunia_2019_CVPR, author = {Bhunia, Ayan Kumar and Das, Abhirup and Bhunia, Ankan Kumar and Kishore, Perla Sai Raj and Roy, Partha Pratim}, title = {Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning}, booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2019} }
	Improving Document Binarization via Adversarial Noise-Texture Augmentation Ankan Kumar Bhunia, Ayan Kumar Bhunia , Aneeshan Sain, Partha Pratim Roy. IEEE Conference on Image Processing ( ICIP* ), 2019 Abstract / Code / arXiv / BibTex (Top 10% Papers)* Binarization of degraded document images is an elementary step in most of the problems in document image analysis domain. The paper re-visits the binarization problem by introducing an adversarial learning approach. We construct a Texture Augmentation Network that transfers the texture element of a degraded reference document image to a clean binary image. In this way, the network creates multiple versions of the same textual content with various noisy textures, thus enlarging the available document binarization datasets. At last, the newly generated images are passed through a Binarization network to get back the clean version. By jointly training the two networks we can increase the adversarial robustness of our system. Also, it is noteworthy that our model can learn from unpaired data. Experimental results suggest that the proposed method achieves superior performance over widely used DIBCO datasets. @inproceedings{bhunia2019improving, title={Improving document binarization via adversarial noise-texture augmentation}, author={Bhunia, Ankan Kumar and Bhunia, Ayan Kumar and Sain, Aneeshan and Roy, Partha Pratim}, booktitle={2019 IEEE International Conference on Image Processing (ICIP)}, pages={2721--2725}, year={2019}, organization={IEEE} }
	A Deep One-Shot Network for Query-based Logo Retrieval Ayan Kumar Bhunia , Ankan Kumar Bhunia, Shuvozit Ghose, Abhirup Das, Partha Pratim Roy, Umapada Pal Pattern Recognition ( PR ), 2019 Abstract / Code / Third Party Implementation / arXiv / BibTex Logo detection in real-world scene images is an important problem with applications in advertisement and marketing. Existing general-purpose object detection methods require large training data with annotations for every logo class. These methods do not satisfy the incremental demand of logo classes necessary for practical deployment since it is practically impossible to have such annotated data for new unseen logo. In this work, we develop an easy-to-implement query-based logo detection and localization system by employing a one-shot learning technique using off the shelf neural network components. Given an image of a query logo, our model searches for logo within a given target image and predicts the possible location of the logo by estimating a binary segmentation mask. The proposed model consists of a conditional branch and a segmentation branch. The former gives a conditional latent representation of the given query logo which is combined with feature maps of the segmentation branch at multiple scales in order to obtain the matching location of the query logo in a target image. Feature matching between the latent query representation and multi-scale feature maps of segmentation branch using simple concatenation operation followed by 1 × 1 convolution layer makes our model scale-invariant. Despite its simplicity, our query-based logo retrieval framework achieved superior performance in FlickrLogos-32 and TopLogos-10 dataset over different existing baseline methods. @article{bhunia2019deep, title={A Deep One-Shot Network for Query-based Logo Retrieval}, author={Bhunia, Ayan Kumar and Bhunia, Ankan Kumar and Ghose, Shuvozit and Das, Abhirup and Roy, Partha Pratim and Pal, Umapada}, journal={Pattern Recognition}, pages={106965}, year={2019}, publisher={Elsevier}}
	User Constrained Thumbnail Generation Using Adaptive Convolutions Perla Sai Raj Kishore, Ayan Kumar Bhunia , Shovozit Ghose, Partha Pratim Roy International Conference on Acoustics, Speech and Signal Processing ( ICASSP* ), 2019 Abstract / Code / arXiv / BibTex (Oral Presentation)* Thumbnails are widely used all over the world as a preview for digital images. In this work we propose a deep neural framework to generate thumbnails of any size and aspect ratio, even for unseen values during training, with high accuracy and precision. We use Global Context Aggregation (GCA) and a modified Region Proposal Network (RPN) with adaptive convolutions to generate thumbnails in real time. GCA is used to selectively attend and aggregate the global context information from the entire image while the RPN is used to generate candidate bounding boxes for the thumbnail image. Adaptive convolution eliminates the difficulty of generating thumbnails of various aspect ratios by using filter weights dynamically generated from the aspect ratio information. The experimental results indicate the superior performance of the proposed model 1 over existing state-of-the-art techniques. @inproceedings{kishore2019user, title={User Constrained Thumbnail Generation Using Adaptive Convolutions}, author={Kishore, Perla Sai Raj and Bhunia, Ayan Kumar and Ghose, Shuvozit and Roy, Partha Pratim}, booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, pages={1677--1681}, year={2019}, organization={IEEE} }
	Texture Synthesis Guided Deep Hashing for Texture Image Retrieval Ayan Kumar Bhunia , Perla Sai Raj Kishore, Pranay Mukherjee, Abhirup Das, Partha Pratim Roy IEEE Winter Conference on Applications of Computer Vision ( WACV* )*, 2019 Abstract / arXiv / BibTex / Video Presentation With the large scale explosion of images and videos over the internet, efficient hashing methods have been developed to facilitate memory and time efficient retrieval of similar images. However, none of the existing works use hashing to address texture image retrieval mostly because of the lack of sufficiently large texture image databases. Our work addresses this problem by developing a novel deep learning architecture that generates binary hash codes for input texture images. For this, we first pre-train a Texture Synthesis Network (TSN) which takes a texture patch as input and outputs an enlarged view of the texture by injecting newer texture content. Thus it signifies that the TSN encodes the learnt texture specific information in its intermediate layers. In the next stage, a second network gathers the multi-scale feature representations from the TSN’s intermediate layers using channel-wise attention, combines them in a progressive manner to a dense continuous representation which is finally converted into a binary hash code with the help of individual and pairwise label information. The new enlarged texture patches from the TSN also help in data augmentation to alleviate the problem of insufficient texture data and are used to train the second stage of the network. Experiments on three public texture image retrieval datasets indicate the superiority of our texture synthesis guided hashing approach over existing state-of-the-art methods. @inproceedings{bhunia2019texture, title={Texture synthesis guided deep hashing for texture image retrieval}, author={Bhunia, Ayan Kumar and Perla, Sai Raj Kishore and Mukherjee, Pranay and Das, Abhirup and Roy, Partha Pratim}, booktitle={2019 IEEE Winter Conference on Applications of Computer Vision (WACV)}, pages={609--618}, year={2019}, organization={IEEE} }
	Script identification in natural scene image and video frames using an attention based Convolutional-LSTM network Ankan Kumar Bhunia, Aishik Konwer, Ayan Kumar Bhunia , Abir Bhowmick, Partha Pratim Roy, Umapada Pal Pattern Recognition ( PR ), 2019 Abstract / Code / arXiv / BibTex Script identification plays a significant role in analysing documents and videos. In this paper, we focus on the problem of script identification in scene text images and video scripts. Because of low image quality, complex background and similar layout of characters shared by some scripts like Greek, Latin, etc., text recognition in those cases become challenging. In this paper, we propose a novel method that involves extraction of local and global features using CNN-LSTM framework and weighting them dynamically for script identification. First, we convert the images into patches and feed them into a CNN-LSTM framework. Attention-based patch weights are calculated applying softmax layer after LSTM. Next, we do patch-wise multiplication of these weights with corresponding CNN to yield local features. Global features are also extracted from last cell state of LSTM. We employ a fusion technique which dynamically weights the local and global features for an individual patch. Experiments have been done in four public script identification datasets: SIW-13, CVSI2015, ICDAR-17 and MLe2e. The proposed framework achieves superior results in comparison to conventional methods. @article{bhunia2019script, title={Script identification in natural scene image and video frames using an attention based convolutional-LSTM network}, author={Bhunia, Ankan Kumar and Konwer, Aishik and Bhunia, Ayan Kumar and Bhowmick, Abir and Roy, Partha P and Pal, Umapada}, journal={Pattern Recognition}, volume={85}, pages={172--184}, year={2019}, publisher={Elsevier} }

Template credits : Dr. Jon Barron