arXiv 2506.05425

SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

Fanqi Kong1,2, Weiqin Zu2,4, Xinyu Chen1, Yaodong Yang1, Song-Chun Zhu1,2,3, Xue Feng2 ✉

1Peking University  ·  2State Key Lab of General AI, BIGAI  ·  3Tsinghua University  ·  4ShanghaiTech University

Paper Code Dataset

SIV-Bench evaluates how well multimodal language models understand real-world social interaction. Built on 2,792 curated video clips and 5,455 human-verified question–answer pairs, the benchmark spans three dimensions — perceiving what is visibly happening, inferring the latent social and mental state, and predicting how interactions will unfold — grounded in Fiske's relational models theory.

Examples of SIV-Bench videos and tasks

SSU · Social Scene Understanding

Ground visible cues: action, expression, environment, attributes.

SSR · Social State Reasoning

Infer latent signals: emotion, intent, attitude, relation.

SDP · Social Dynamics Prediction

Predict how interactions unfold, factually and counterfactually.

Construction Pipeline

Source clips are mined from TikTok and YouTube, paired with QA generated by an LLM, then refined through adversarial filtering and human verification. Each video is evaluated under three subtitle conditions to isolate visual, auditory, and textual grounding.

SIV-Bench data construction pipeline

Main Results

Accuracy (%) on SSU, SSR, SDP, and the overall benchmark, under origin / +subtitle / −subtitle settings. Strong perception, but a consistent gap on social reasoning.

Model Params SSU SSR SDP Overall
orig+sub−sub orig+sub−sub orig+sub−sub orig+sub−sub
Open-source MLLMs
mPLUG-Owl37B 46.1145.9446.34 39.7839.5038.13 44.3046.0844.35 42.0642.4241.15
LLaVA-OneVision7B 39.2039.4140.08 41.9543.6538.66 43.7944.5139.42 41.9743.0439.42
LLaVA-Video7B 50.2250.6150.56 39.3338.1936.14 41.6042.1439.94 41.0942.0038.66
Qwen2.5-VL-7B-Instruct7B 51.2250.8850.22 40.2438.9437.66 42.6943.8242.24 44.0244.2141.65
InternVL3-8B8B 56.8356.1356.50 40.3540.9037.92 44.5345.5244.40 45.8246.0544.56
Qwen2.5-VL-72B-Instruct72B 75.7376.2473.54 52.2552.7551.21 59.0258.4057.78 58.8059.6357.66
InternVL3-78B78B 71.4673.6671.76 51.6552.3950.14 55.7756.2854.25 55.4656.3254.50
Closed-source MLLMs
o4-mini 78.8379.0478.13 50.4751.3048.99 56.8956.0055.26 55.6855.8954.54
GPT-4o 79.1079.7478.06 52.7353.2051.79 59.0260.5958.60 58.0258.8656.99
Gemini-2.0-Flash 78.4678.1678.34 51.8952.4349.78 57.5958.6355.70 56.4057.2354.64
Gemini-2.5-Flash 81.7082.1479.71 48.9950.5447.60 59.4759.9556.88 57.8758.1156.05
Gemini-2.5-Pro 85.0785.4184.94 54.3054.8552.32 60.4561.5458.83 61.6562.4060.22

Best overall: Gemini-2.5-Pro at 62.40% (+sub). Best open-source: Qwen2.5-VL-72B at 59.63% (+sub). Removing subtitles costs more on SSR (−2.07) and SDP (−1.68) than on SSU (−0.97), suggesting models lean on text more for higher-level reasoning.

SIV-Bench-Hard: Reasoning Quality vs. Humans

A focused subset isolates the most failure-prone questions and adds short rationales, so we can compare both answer accuracy and reasoning quality against a human baseline.

Model Acc% Relevance Alignment Coherence Depth Conciseness Overall
Reference
Human (n=3)74.40
Models
Gemini-3-Pro45.504.663.304.673.494.874.10
GPT-5.139.004.583.294.653.264.884.00
Gemini-2.5-Pro37.004.573.264.653.414.894.05
Gemini-2.5-Flash32.324.483.174.553.224.873.95
GPT-4o-mini29.004.453.204.563.124.913.90
Qwen2.5-VL-7B24.504.002.894.213.054.453.63

The best model still trails the human baseline by ~29 points in answer accuracy. The gap is largest on alignment (how well the rationale fits social context) — where every model scores in the low-3 range against humans.

Citation

@misc{kong2025sivbench,
  title  = {SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning},
  author = {Fanqi Kong and Weiqin Zu and Xinyu Chen and Yaodong Yang and Song-Chun Zhu and Xue Feng},
  year   = {2025},
  eprint = {2506.05425},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url    = {https://arxiv.org/abs/2506.05425}
}