LongShOTA Benchmark & Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos
Mohammed Irfan Kurpath*, Jaseel Muhammad Kaithakkodan*, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad2, Fahad Shahbaz Khan3, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal
* Equal Contribution
Abstract
Bridging the gap in long-form video understanding
Q&A Samples
Open-ended, intent-driven
Task Categories
Perception to reasoning
Avg. Duration
Long-form video
Human Verified
Manually reviewed
Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable and traceable evaluation.
Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%.
Pipeline
How LongShOTBench is built
From raw long-form video to a diagnostic, human-verified benchmark — a five-stage pipeline with multimodal extraction, intent-driven Q&A generation, and graded evaluation rubrics.

Raw Video Content — Cooking Tutorial — 1h 20 min
Multimodal Signal Extraction
Cross-Modal Alignment & Fusion
Intent-Driven Question Design
Rubric & Evaluation Framework
Human Validation & Correction
Figure 1 — The pipeline begins with raw video data, extracts multimodal signals, generates intent-driven Q&A, creates verifiable rubrics, and undergoes human validation.
Comparison
The most comprehensive video benchmark
Comparing LongShOTBench against 14 existing benchmarks across 8 capability dimensions.
Our Benchmark
LongShOTBench
The only benchmark combining all three modalities with intent-driven Q&A, agentic tool usage, and custom rubrics for interpretable evaluation.
* Subtitle aided
What makes us unique
Intent-driven Q&A
The same video generates entirely different questions depending on who is watching and why. Every answer is evaluated against a custom rubric — rewarding evidence, penalizing hallucination.
E-Bike Review
Battery, motor, terrain, connectivity
Commuter
Practical daily insights
Does the e-bike's battery last long enough for my daily 20 km commute without charging?
Yes, the reviewer found the battery retained 80% after 25 km, easily covering your 20 km commute without mid-day charging. Some expect full drain within 10 km based on unofficial 16 km estimates. The real-world test proves otherwise.
Same video. Different intents. Different questions.
Each evaluated with a custom rubric that rewards correct evidence and penalizes hallucinated claims — making LongShOTBench diagnostic, not just a score.