Sourajit Saha
he/him/his
University of Maryland, Baltimore County
Computer Science and Electrical Engineering
Biography
I am a Ph.D. student in Computer Science at the University of Maryland, Baltimore County (UMBC), where I work under the guidance of Dr. Tejas Gokhale in the Cognitive Vision Group. I received my Masters in Computer Science from UMBC and Bachelor of Science from BRAC University in Bangladesh. My research focuses on advancing the capabilities of computer vision systems by addressing challenges in interactive video retrieval, visual reasoning, and video understanding. In the area of interactive video retrieval, search, and understanding, I explore methods that integrate vision-language models (VLMs), scene-graph reasoning, dialogue-driven interaction and lessen the burden of human annotation. My goal is to make video retrieval more interactive and semantically aligned with human intent. My work on visual reasoning investigates spatial understanding, counterfactual inference, and visual editing, with the broader aim of improving interpretability, adaptability in AI models. This line of research seeks to push models beyond pattern recognition toward deeper semantic understanding. In Summer Camp for Applied Language Exploration (SCALE) 2024 at Johns Hopkins University, I have also worked in semantic video frame sampling, caption based video event localization to enhance event retrieval in multilingual video.
Academic Status
PhD Student - 5th
Research Area/Department
Computer Science; Machine Learning/AI; other
Major/Specialty
Computer Science (Computer Vision, Machine Learning)
Degrees Earned or in Progress
2021–Ongoing PhD, Computer Science, University of Maryland, Baltimore County, Advisor: Tejas Gokhale. 2021–2023 Masters of Science, Computer Science, University of Maryland, Baltimore County, Academic Supervisor: Tim Oates, David Chapman. 2013–2017 Bachelor of Science, Computer Science, BRAC University, Advisor: Suraiya Tairin
Academic Preparation
Computer Vision, Image Processing, Data Visualization, Natural Language Processing, Machine Learning, Pattern Recognition, Artificial Intelligence, Advanced Artificial Intelligence, Optimization Algorithms, Design and Analysis of Algorithms, Advanced Computer Architecture
Research/Publications
Saha, Shaswati, Sourajit Saha, Manas Gaur, and Tejas Gokhale. "Side Effects of Erasing Concepts from Diffusion Models." arXiv preprint arXiv:2508.15124 (2025), EMNLP 2025 Findings. Saha, Sourajit, and Tejas Gokhale. "Improving shift invariance in convolutional neural networks with translation invariant polyphase sampling." In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 620-629. IEEE, 2025. Saha, Sourajit, Shaswati Saha, Md Osman Gani, Tim Oates, and David Chapman. "RFC-Net: Learning High Resolution Global Features for Medical Image Segmentation on a Computational Budget (Student Abstract)." In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 13, pp. 16314-16315. 2023. Saha, Sourajit, and Yaacov Yesha. "Pairwise meta learning pipeline: classifying COVID-19 abnormalities on chest radio-graphs." SPIE Medical Imaging 2022: Computer-Aided Diagnosis; PC1203302 (2022) Proceedings Volume PC12033, Medical Imaging 2022: Computer-Aided Diagnosis; PC1203302 (2022) (2022). Kamran, Sharif Amit, Sourajit Saha, Ali Shihab Sabbir, and Alireza Tavakkoli. "Optic-net: A novel convolutional neural network for diagnosis of retinal diseases from optical tomography images." In 2019 18th IEEE international conference on machine learning and applications (ICMLA), pp. 964-971. IEEE, 2019.
Research/Academic Interests
My research lies at the intersection of computer vision, multimodal reasoning, and video understanding, with a focus on developing interactive and human-centered video retrieval systems. As a Ph.D. student at UMBC in the Cognitive Vision Group under Dr. Tejas Gokhale, I explore methods that combine vision-language models, scene-graph reasoning, and dialogue-driven interaction to reduce annotation burden and make retrieval more semantically aligned with human intent. In interactive video retrieval, I am interested in designing systems that allow users to refine search through natural, context-aware interaction. In visual reasoning, my work investigates spatial understanding, counterfactual inference, and visual editing, with the broader goal of pushing models beyond pattern recognition toward deeper semantic understanding. These directions aim to improve both the interpretability and adaptability of AI systems. At Summer Camp for Applied Language Exploration (SCALE) 2024, I contributed to semantic video frame sampling and caption-based event localization to enhance event retrieval in multilingual videos. This experience sharpened my focus on bridging retrieval and reasoning for diverse, real-world applications. Moving forward, my dissertation will expand on these foundations, with an emphasis on interactive, scalable, and explainable systems. Ultimately, I aim to advance multimodal AI that is interpretable, inclusive, and impactful across disciplines.
Computational and Data Science Areas
Applied Computer Science; Artificial Intelligence and Intelligent Systems; Computer Science; Electrical, Electronic, and Information Engineering; Informatics, Analytics and Information Science; Visualization and Human-Computer Systems
Motivation
I am eager to join the Sustainable Research Pathways program because it directly supports my Ph.D. research on interactive video retrieval, multimodal reasoning, and video understanding at UMBC under Dr. Tejas Gokhale. My work integrates vision-language models, scene-graph reasoning, and dialogue-driven interaction to make retrieval systems more semantically aligned with human intent while reducing annotation burdens. At Summer Camp for Applied Language Exploration (SCALE) 2024, hosted by Johns Hopkins University, I contributed to semantic video frame sampling and caption-based event localization for multilingual retrieval. This experience broadened my perspective on large-scale collaborative research and highlighted the importance of advancing video retrieval and reasoning to support diverse, real-world applications. Building on this foundation through SCALE 2026 will allow me to collaborate with experts in video reasoning-summarizing, multimodal event detection-localization, and multimodal retrieval—areas central to my dissertation. The Sustainable Horizons Institute’s mission resonates with me as a Bangladeshi scholar pursuing research in the U.S. I deeply value inclusive scientific communities where diverse voices thrive. Through this program, I hope to contribute my expertise, grow within a supportive research network, and advance scalable, human-centered AI systems that benefit the broader scientific ecosystem.
Lightning Talk Title
Bridging Vision and Language: Towards Interactive Multi-Modal Search and Reasoning
Keywords (Maximum 20 words)
Interactive Visual Search; Multi-modal Retrieval; Multi-modal Reasoning; Video Understanding; Computer Vision