SHI Collaboration Profiles

Profile pages for Sustainable Horizons Institute SRP 2025-2026 Project Leaders

Reno Kriz

Johns Hopkins University

Human Language Technology Center of Excellence

Biography

Reno Kriz is a research scientist at the Johns Hopkins University Human Language Technology Center of Excellence (HLTCOE). His primary research interests involve leverage large pre-trained models for a variety of natural language understanding tasks, including those crossing into other modalities, e.g., vision and speech understanding. These multimodal interests have recently involved the 2024 and 2026 Summer Camps for Language Exploration (SCALE) on event-centric video retrieval, understanding, and summarization. He received his PhD from the University of Pennsylvania where he worked with Chris Callison-Burch and Marianna Apidianaki on text simplification and natural language generation. Prior to that, he received BA degrees in Computer Science, Mathematics, and Economics from Vassar College.

SRP Project Title

SCALE 2026: Event Understanding and Summarization from Real-time Videos

NAIRR Project

Advancing Scientific Discovery through Multilingual/Multimodal Summarization at SCALE 2025/2026

Topical Areas

Artificial Intelligence and Intelligent Systems; Computer Science; Electrical, Electronic, and Information Engineering; Informatics, Analytics and Information Science

Abstract

understand real-time, multilingual video content is increasingly important. From smartphone footage of natural disasters to public livestreams near high-risk infrastructure, these unedited clips offer firsthand evidence of unfolding events. Combined with audio and embedded text, they form a rich multimodal source that remains underutilized in current retrieval-augmented generation systems. Especially for real-time situations, scientific advances grounding articles in video can combat misinformation and help journalists quickly synthesize information from non-traditional, cross-lingual platforms. SCALE 2026, a 10-week workshop hosted by the Human Language Technology Center of Excellence (HLTCOE) at Johns Hopkins University, provides a realistic setting for advancing real-world multimodal understanding. During the summer, we will evaluate modality-specific technologies for extracting relevant signals from raw video data. First-stage research areas include audio and visual event detection, speech and audio summarization, and OCR or visual frame analysis. These signals will inform the second stage, our primary task of multimodal retrieval-augmented generation. Given an information need and a collection of raw multilingual videos, the system must retrieve relevant content and generate a coherent summary of the most significant information. Second-stage research areas include multimodal information retrieval and multi-video summarization.

Desired Skills

We welcome participants with backgrounds or interests in natural language processing, computer vision, and multimodal technologies. Relevant experience might include work with large language or vision-language models, video or audio understanding, and multilingual or low-resource technologies. Participants interested in areas such as event detection, speech processing, optical character recognition, information retrieval, or summarization are especially encouraged to apply. SCALE projects are highly collaborative, bringing together researchers from diverse areas of expertise, so openness to interdisciplinary teamwork and shared problem-solving is essential.

Additional Comments

SCALE (the Summer Camp for Applied Language Exploration) is an annual 10-week research program hosted by the HLTCOE at Johns Hopkins University since 2009. For more on its history and past topics, see https://hltcoe.jhu.edu/research/scale/. SCALE 2026 builds on the two most recent workshops: SCALE 2024, focused on event-centric video retrieval, and SCALE 2025, which explored retrieval-augmented generation for request-guided summarization of multilingual sources.

Lightning Talk Title

SCALE 2026: Event Understanding and Summarization from Real-time Videos

Keywords

multimodal retrieval-augmented generation; multi-video summarization; video retrieval; speech/audio summarization; optical character recognition; audio/visual event detection; computer vision; speech processing