ByteDance Proposes Dense Video Multimodal Large Model Sa2VA

Sa2VA, developed by ByteDance and Peking University, integrates SAM-2 and LLaVA to achieve advanced spatiotemporal video and image understanding, outperforming existing models in multiple tasks.

Feb 12, 2025

∙ Paid

"AI Disruption" publication New Year 30% discount link.

Sa2VA Model Zoo - a ByteDance Collection

In a recent paper, researchers from ByteDance, Peking University, and other institutions introduced Sa2VA.

It is the first video multimodal large model that combines SAM-2 and LLaVA-like capabilities. By integrating the strengths of SAM-2 and LLaVA, it achieves spatiotemporal fine-grained understanding.

Specifically, the researchers designed a unified instruction tuning format (Instruction Tuning Pipeline), integrating five different tasks and over 20 datasets for joint training.

This model has achieved leading results in multiple video and image understanding tasks, including video-referring expression segmentation and image-referring expression segmentation.

AI Disruption

ByteDance Proposes Dense Video Multimodal Large Model Sa2VA

Sa2VA, developed by ByteDance and Peking University, integrates SAM-2 and LLaVA to achieve advanced spatiotemporal video and image understanding, outperforming existing models in multiple tasks.

This post is for paid subscribers