AI Disruption

AI Disruption

Share this post

AI Disruption
AI Disruption
DeepSeek R1 Technology Successfully Migrates to the Multimodal Domain, Fully Open Sourced
Copy link
Facebook
Email
Notes
More

DeepSeek R1 Technology Successfully Migrates to the Multimodal Domain, Fully Open Sourced

Discover Visual-RFT—an open-source breakthrough that extends DeepSeek-R1’s rule-based reinforcement learning to vision-language models for efficient few-shot learning.

Meng Li's avatar
Meng Li
Mar 04, 2025
∙ Paid
4

Share this post

AI Disruption
AI Disruption
DeepSeek R1 Technology Successfully Migrates to the Multimodal Domain, Fully Open Sourced
Copy link
Facebook
Email
Notes
More
2
Share

Today, we are excited to introduce a breakthrough open-source project for visual reinforcement fine-tuning — Visual-RFT (Visual Reinforcement Fine-Tuning).

Visual-RFT extends the rule-based reward reinforcement learning method behind DeepSeek-R1 and OpenAI’s Reinforcement Fine-Tuning (RFT) paradigm from pure text large language models to vision-language large models (LVLM).

By designing corresponding rule-based rewards for tasks such as fine-grained classification and object detection in the visual domain, Visual-RFT breaks the limitations of the DeepSeek-R1 method—which was previously confined to a few areas like text, mathematical reasoning, and code—thus paving a new path for training vision-language models!

Figure 1 illustrates this process with an image containing many Pokémon. When the model trained through multimodal reinforcement fine-tuning Visual-RFT is asked which Pokémon can use the move “Thunderbolt,” the model accurately locates the bounding box corresponding to Pikachu through a < think > reasoning process, showcasing the model’s generalization capability.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More