LongShTA Benchmark for Omni-Modal Reasoning in Long Videos

Mohammed Irfan Kurpath^*, Jaseel Muhammad Kaithakkodan^*, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad², Fahad Shahbaz Khan³, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

* Equal Contribution

¹MBZUAI²American University of Beirut³Linkoping University

Paper Demo Code Dataset Leaderboard

Visual

Audio

Speech

Reasoning

Retrieval

Rubrics

Q: What happens at 32:15?

Intent-driven Q&A

66.64%

LongShOTAgent

Abstract

Bridging the gap in long-form video understanding

Q&A Pairs

Open-ended, intent-driven

Models Benchmarked

6 paradigms evaluated

0min

Avg. Duration

Long-form video

Annotation Hours

Human validation effort

Human Verified

Manually reviewed

Long-form omni-modal video understanding requires models to integrate vision, speech, and ambient audio with coherent long-context reasoning. Existing video benchmarks often trade off temporal scale, modality coverage, open-ended interaction, and interpretable scoring. We introduce LongShOTBench, a long video understanding benchmark designed around three coupled goals: holistic omni-modal integration, intent-driven open-ended interaction, and rubric-level diagnosis. Each item includes a reference answer and a weighted criterion-level rubric, enabling evaluation to identify which perceptual and reasoning steps are satisfied or missed.

We also introduce LongShOTAgent, a training-free omni-modal evidence-seeking agent that couples full-video preprocessing with targeted retrieval, query-adaptive segment refinement, and explicit claim verification. We perform comprehensive evaluation of 105 video-capable models spanning open-source omni-modal models, vision-language systems, audio LLMs, agentic pipelines, and closed-source APIs. The strongest closed-source API, Gemini 3.1 Pro, reaches 55.63%, the best open-source model, Qwen3-Omni 30B-T, reaches 64.05%, while our LongShOTAgent emerges as the strongest system at 66.64%.

Demo

See LongShOT in Action

A walkthrough of LongShOTAgent, from question to grounded answer.

Findings

What the benchmark reveals

Evaluating 105 video-capable models surfaces consistent, sometimes counterintuitive patterns about where long-form omni-modal reasoning breaks down today.

Architecture > scale

Small omni models beat much larger ones

Overall scoreparams

Qwen3-Omni 30B-T

64.0

30B

Qwen3-VL 235B-T

47.2

235B

Kimi-K2.6

45.2

1.1T

Gemma-3 27B

41.5

27B

InternVL3.5 241B

29.8

241B

A 30B omni model tops a 235B video model, and a 27B model beats a 241B one. The bottleneck is cross-modal alignment, not raw parameter count.

Reasoning > perception

Models reason better than they perceive

ReasoningCore perception

Gemini 3.1 Pro

65.843.0

+22.8

Qwen3-Omni 30B-T

72.951.4

+21.4

MiMo-VL 7B

42.529.9

+12.6

Kimi-VL-T

27.517.8

+9.7

ERNIE-4.5-VL 28B

30.623.1

+7.5

Across methods, reasoning scores sit well above core perception. Grounding specific facts at specific moments is the harder axis, by up to 23 points.

The audio bottleneck

Non-speech audio is the weakest modality

SpeechNon-speech audio

Qwen3-Omni 30B-T

63.755.9

−7.9

Gemini 3.1 Pro

53.646.1

−7.5

Nemotron-3 Omni 30B

40.032.9

−7.1

HumanSense Omni-R

18.0

−2.8

SALMONN2+ 7B

13.4

−2.9

Within the audio channel, transcribable speech is handled well but ambient non-speech audio lags behind. Grounding door clicks and music shifts to the timeline remains an open problem.

Test-time compute

Explicit thinking helps a lot

ThinkingInstruct

Qwen3-Omni 30B

64.039.2

+24.9

Qwen3-VL 30B-A3B

43.430.7

+12.7

Kimi-VL

24.315.8

+8.6

Qwen3-VL 8B

42.034.5

+7.5

Qwen3-VL 235B

42.2

+5.0

Within a family, thinking variants outscore their instruct siblings by 5 to 25 points, with the largest gains on the strongest omni models.

The Agent

Why the agent wins

LongShOTAgent is training-free. Its search-refine-verify loop lifts a base model to the top of the benchmark and outperforms prior vision-centric agents that overlook non-speech audio.

The agentic loop lifts the baseline model

Wrapping a base LLM in the search-refine-verify loop adds 12.95 to 38.52 points. The orchestrator, not just tool access, drives the gain.

Qwen3.6-35B-A3B

27.1765.69

38.52

Qwen3-VL-30B-A3B

25.8252.54

26.72

Gemma-4-31B-IT

34.2247.17

12.95

Vision-centric agents don't transfer

Prior long-video agents all score below 15% on LongShOTBench. Treating non-speech audio as a first-class signal is what closes the gap.

LongShOTAgent

66.64

VideoMind

10.83

Video-RAG

6.66

Vgent

5.83

Overall score (%), 3-verifier mean. LongShOTAgent leads the next agent by over 6x.

Robust where others game the test

The advantage is comprehension, not multiple-choice shortcuts

On standard MCQ benchmarks like Video-MME and WorldSense, strong monolithic models sit a few points ahead. But strip the answer options and force open-ended responses, and they collapse while LongShOTAgent barely moves, because it generates answers from retrieved evidence rather than picking from a list.

Video-MMEMCQ → open-ended

LongShOTAgent

69.9

-1.3pp

Qwen3-Omni-T

69.756.1

-13.6pp

Qwen3-Omni-I

70.551.8

-18.7pp

Qwen3.5-35B-A3B

78.958.4

-20.5pp

WorldSenseMCQ → open-ended

LongShOTAgent

43.4

-3.2pp

Qwen3-Omni-T

52.633.6

-19.1pp

Qwen3-Omni-I

54.041.7

-12.3pp

Accuracy (%): change from standard 4-option MCQ to options-stripped open-ended. The solid bar is the open-ended score over the faint MCQ extent; the right column shows the open-ended score, its MCQ baseline, and the change. On both benchmarks LongShOTAgent has the smallest drop and leads open-ended.

Comparison

A comprehensive video benchmark

Comparing LongShOTBench against 18 existing benchmarks across 7 capability dimensions.

Our Benchmark

LongShOTBench

The only benchmark in our comparison to combine all three modalities with intent-driven Q&A, multi-turn dialogue, and custom rubrics for interpretable evaluation.

7/7

VisualAudioSpeechOpen-EndedMulti-TurnIntent-DrivenRubrics

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

LongShOTBench

VideoOdyssey

LVOmniBench

WorldSense

OmniVideoBench

Daily-Omni

TriSense-2M

LongVALE

Video-MME

InfiniBench

Video-Holmes

LvBench (MoVQA)

LVBench

SVBench

MLVU

Moviechat

LongVideoBench

EgoSchema

MV-Bench

LongShOTBench7/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

VideoOdyssey3/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

LVOmniBench3/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

WorldSense3/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

OmniVideoBench3/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

Daily-Omni3/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

TriSense-2M4/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

LongVALE4/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

Video-MME3/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

InfiniBench3/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

Video-Holmes3/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

LvBench (MoVQA)2/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

LVBench1/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

SVBench3/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

MLVU2/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

Moviechat2/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

LongVideoBench2/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

EgoSchema1/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

MV-Bench1/7

Visual

Audio

Speech

Open-Ended

Multi-Turn

Intent-Driven

Rubrics

^* Subtitle aided

Pipeline

How LongShOTBench is built

From raw long-form video to a diagnostic, human-verified benchmark: a five-stage pipeline with multimodal extraction, intent-driven Q&A generation, and graded evaluation rubrics.

Raw Video Content · Cooking Tutorial · 1h 20 min

Multimodal Signal Extraction

VisualFrames & scenes

AudioAmbient sounds

SpeechTranscripts

Cross-Modal Alignment & Fusion

Segment-wise alignmentTemporal syncDistilled metadata

Intent-Driven Question Design

Scenario mappingQuestion typesMulti-turn dialoguesConversational answers

Rubric & Evaluation Framework

Correctness criteriaDifficulty gradingTraceable scoring

Human Validation & Correction

Manual reviewError correctionQuality assurance

LongShOTBench

4,893 Q&A pairs · Graded rubrics · Human verified

Figure 1: The pipeline begins with raw video data, extracts multimodal signals, generates intent-driven Q&A, creates verifiable rubrics, and undergoes human validation.

Citation

Cite LongShOT

If you find LongShOT useful in your research, please cite our paper.

@misc{kurpath2026benchmarkomnimodalreasoninglong,
  title={A Benchmark for Omni-Modal Reasoning in Long Videos},
  author={Mohammed Irfan Kurpath and Jaseel Muhammad Kaithakkodan and Jinxing Zhou and Sahal Shaji Mullappilly and Mohammad Almansoori and Noor Ahsan and Beknur Kalmakhanbet and Sambal Shikhar and Rishabh Lalla and Jean Lahoud and Mariette Awad and Fahad Shahbaz Khan and Salman Khan and Rao Muhammad Anwer and Hisham Cholakkal},
  year={2026},
  eprint={2512.16978},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2512.16978}
}