SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning

SIV-Bench evaluates how well multimodal language models understand real-world social interaction. Built on 2,792 curated video clips and 5,455 human-verified question–answer pairs, the benchmark spans three dimensions — perceiving what is visibly happening, inferring the latent social and mental state, and predicting how interactions will unfold — grounded in Fiske's relational models theory.

SSU · Social Scene Understanding

Ground visible cues: action, expression, environment, attributes.

SSR · Social State Reasoning

Infer latent signals: emotion, intent, attitude, relation.

SDP · Social Dynamics Prediction

Predict how interactions unfold, factually and counterfactually.

Construction Pipeline

Source clips are mined from TikTok and YouTube, paired with QA generated by an LLM, then refined through adversarial filtering and human verification. Each video is evaluated under three subtitle conditions to isolate visual, auditory, and textual grounding.

Main Results

Accuracy (%) on SSU, SSR, SDP, and the overall benchmark, under origin / +subtitle / −subtitle settings. Strong perception, but a consistent gap on social reasoning.

Model	Params	SSU			SSR			SDP			Overall
Model	Params	orig	+sub	−sub	orig	+sub	−sub	orig	+sub	−sub	orig	+sub	−sub
Open-source MLLMs
mPLUG-Owl3	7B	46.11	45.94	46.34	39.78	39.50	38.13	44.30	46.08	44.35	42.06	42.42	41.15
LLaVA-OneVision	7B	39.20	39.41	40.08	41.95	43.65	38.66	43.79	44.51	39.42	41.97	43.04	39.42
LLaVA-Video	7B	50.22	50.61	50.56	39.33	38.19	36.14	41.60	42.14	39.94	41.09	42.00	38.66
Qwen2.5-VL-7B-Instruct	7B	51.22	50.88	50.22	40.24	38.94	37.66	42.69	43.82	42.24	44.02	44.21	41.65
InternVL3-8B	8B	56.83	56.13	56.50	40.35	40.90	37.92	44.53	45.52	44.40	45.82	46.05	44.56
Qwen2.5-VL-72B-Instruct	72B	75.73	76.24	73.54	52.25	52.75	51.21	59.02	58.40	57.78	58.80	59.63	57.66
InternVL3-78B	78B	71.46	73.66	71.76	51.65	52.39	50.14	55.77	56.28	54.25	55.46	56.32	54.50
Closed-source MLLMs
o4-mini	—	78.83	79.04	78.13	50.47	51.30	48.99	56.89	56.00	55.26	55.68	55.89	54.54
GPT-4o	—	79.10	79.74	78.06	52.73	53.20	51.79	59.02	60.59	58.60	58.02	58.86	56.99
Gemini-2.0-Flash	—	78.46	78.16	78.34	51.89	52.43	49.78	57.59	58.63	55.70	56.40	57.23	54.64
Gemini-2.5-Flash	—	81.70	82.14	79.71	48.99	50.54	47.60	59.47	59.95	56.88	57.87	58.11	56.05
Gemini-2.5-Pro	—	85.07	85.41	84.94	54.30	54.85	52.32	60.45	61.54	58.83	61.65	62.40	60.22

Best overall: Gemini-2.5-Pro at 62.40% (+sub). Best open-source: Qwen2.5-VL-72B at 59.63% (+sub). Removing subtitles costs more on SSR (−2.07) and SDP (−1.68) than on SSU (−0.97), suggesting models lean on text more for higher-level reasoning.

SIV-Bench-Hard: Reasoning Quality vs. Humans

A focused subset isolates the most failure-prone questions and adds short rationales, so we can compare both answer accuracy and reasoning quality against a human baseline.

Model	Acc%	Relevance	Alignment	Coherence	Depth	Conciseness	Overall
Reference
Human (n=3)	74.40	—	—	—	—	—	—
Models
Gemini-3-Pro	45.50	4.66	3.30	4.67	3.49	4.87	4.10
GPT-5.1	39.00	4.58	3.29	4.65	3.26	4.88	4.00
Gemini-2.5-Pro	37.00	4.57	3.26	4.65	3.41	4.89	4.05
Gemini-2.5-Flash	32.32	4.48	3.17	4.55	3.22	4.87	3.95
GPT-4o-mini	29.00	4.45	3.20	4.56	3.12	4.91	3.90
Qwen2.5-VL-7B	24.50	4.00	2.89	4.21	3.05	4.45	3.63

The best model still trails the human baseline by ~29 points in answer accuracy. The gap is largest on alignment (how well the rationale fits social context) — where every model scores in the low-3 range against humans.

Citation

@misc{kong2025sivbench,
  title  = {SIV-Bench: A Video Benchmark for Social Interaction Understanding and Reasoning},
  author = {Fanqi Kong and Weiqin Zu and Xinyu Chen and Yaodong Yang and Song-Chun Zhu and Xue Feng},
  year   = {2025},
  eprint = {2506.05425},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url    = {https://arxiv.org/abs/2506.05425}
}