CVPR 2026 Findings

LongShOTA Benchmark & Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

Mohammed Irfan Kurpath*, Jaseel Muhammad Kaithakkodan*, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad2, Fahad Shahbaz Khan3, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

* Equal Contribution

1MBZUAI2American University of Beirut3Linkoping University

Abstract

Bridging the gap in long-form video understanding

0

Q&A Samples

Open-ended, intent-driven

0

Task Categories

Perception to reasoning

0min

Avg. Duration

Long-form video

0%

Human Verified

Manually reviewed

Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable and traceable evaluation.

Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%.

Pipeline

How LongShOTBench is built

From raw long-form video to a diagnostic, human-verified benchmark — a five-stage pipeline with multimodal extraction, intent-driven Q&A generation, and graded evaluation rubrics.

Video frames from cooking tutorial

Raw Video Content — Cooking Tutorial — 1h 20 min

01

Multimodal Signal Extraction

VisualFrames & scenes
AudioAmbient sounds
SpeechTranscripts
02

Cross-Modal Alignment & Fusion

Segment-wise alignmentTemporal syncDistilled metadata
03

Intent-Driven Question Design

Scenario mappingQuestion typesMulti-turn dialoguesConversational answers
04

Rubric & Evaluation Framework

Correctness criteriaDifficulty gradingTraceable scoring
05

Human Validation & Correction

Manual reviewError correctionQuality assurance

Figure 1 — The pipeline begins with raw video data, extracts multimodal signals, generates intent-driven Q&A, creates verifiable rubrics, and undergoes human validation.

Comparison

The most comprehensive video benchmark

Comparing LongShOTBench against 14 existing benchmarks across 8 capability dimensions.

Our Benchmark

LongShOTBench

The only benchmark combining all three modalities with intent-driven Q&A, agentic tool usage, and custom rubrics for interpretable evaluation.

8/8
LongShOTBench8/8
DailyOmni3/8
TriSense-2M5/8
LongVALE4/8
Video-MME3/8
InfiniBench3/8
*
Video-Holmes2/8
LvBench3/8
*
LVBench1/8
SVBench3/8
MLVU2/8
Moviechat3/8
LongVideoBench1/8
EgoSchema1/8
MV-Bench1/8

* Subtitle aided

What makes us unique

Intent-driven Q&A

The same video generates entirely different questions depending on who is watching and why. Every answer is evaluated against a custom rubric — rewarding evidence, penalizing hallucination.

E-Bike Review

Battery, motor, terrain, connectivity

Commuter

Practical daily insights

How this user thinks about e-bikes
Practical insights for real-world use
Battery longevity, speed, comfort
Trade-offs between features
Hidden insights beyond specs
Generated Questions
Question

Does the e-bike's battery last long enough for my daily 20 km commute without charging?

Reference Answer

Yes, the reviewer found the battery retained 80% after 25 km, easily covering your 20 km commute without mid-day charging. Some expect full drain within 10 km based on unofficial 16 km estimates. The real-world test proves otherwise.

Evaluation RubricTotal: 30 pts possible
Must mention battery retained over 80% after 25 km+10
Must mention user's 20 km commute scenario+10
Must conclude no mid-day charge needed+10
Claiming full capacity loss within 10 km-5

Same video. Different intents. Different questions.

Each evaluated with a custom rubric that rewards correct evidence and penalizes hallucinated claims — making LongShOTBench diagnostic, not just a score.