ByteDance Proposes Dense Video Multimodal Large Model Sa2VA
Sa2VA, developed by ByteDance and Peking University, integrates SAM-2 and LLaVA to achieve advanced spatiotemporal video and image understanding, outperforming existing models in multiple tasks.
"AI Disruption" publication New Year 30% discount link.
In a recent paper, researchers from ByteDance, Peking University, and other institutions introduced Sa2VA.
It is the first video multimodal large model that combines SAM-2 and LLaVA-like capabilities. By integrating the strengths of SAM-2 and LLaVA, it achieves spatiotemporal fine-grained understanding.
Specifically, the researchers designed a unified instruction tuning format (Instruction Tuning Pipeline), integrating five different tasks and over 20 datasets for joint training.
This model has achieved leading results in multiple video and image understanding tasks, including video-referring expression segmentation and image-referring expression segmentation.