AI Disruption

AI Disruption

Qwen2-VL Released: Visual Agent with Advanced Reasoning and Decision-Making!

Alibaba Open-Sources Qwen2-VL: Understands 20+ Minute Videos, Rivals GPT-4o!

Meng Li's avatar
Meng Li
Aug 30, 2024
∙ Paid

Ali has released Qwen2-VL, open-sourcing the Qwen2-VL-2B and Qwen2-VL-7B models. A 72B version will be available later. Qwen2-VL is the latest visual-language model in the Qwen series.

Key Features:

  • State-of-the-Art Image Understanding: Qwen2-VL excels in image comprehension benchmarks like MathVista, DocVQA, RealWorldQA, and MTVQA, handling various resolutions and aspect ratios.

  • Understanding Long Videos: With streaming capabilities, Qwen2-VL can understand videos over 20 minutes long, enabling tasks like video-based Q&A, conversations, and content creation.

  • Device Control Agent: Qwen2-VL can integrate with devices like phones and robots, executing actions based on visual environments and text instructions, thanks to its advanced reasoning and decision-making abilities.

  • Multilingual Support: Qwen2-VL supports text recognition in images across multiple languages, including most European languages, Japanese, Korean, Arabic, Vietnamese, and more, alongside English and Chinese.

User's avatar

Continue reading this post for free, courtesy of Meng Li.

Or purchase a paid subscription.
© 2026 Meng Li · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture