DeepSeek R1 Technology Successfully Migrates to the Multimodal Domain, Fully Open Sourced
Discover Visual-RFT—an open-source breakthrough that extends DeepSeek-R1’s rule-based reinforcement learning to vision-language models for efficient few-shot learning.
Today, we are excited to introduce a breakthrough open-source project for visual reinforcement fine-tuning — Visual-RFT (Visual Reinforcement Fine-Tuning).
Visual-RFT extends the rule-based reward reinforcement learning method behind DeepSeek-R1 and OpenAI’s Reinforcement Fine-Tuning (RFT) paradigm from pure text large language models to vision-language large models (LVLM).
By designing corresponding rule-based rewards for tasks such as fine-grained classification and object detection in the visual domain, Visual-RFT breaks the limitations of the DeepSeek-R1 method—which was previously confined to a few areas like text, mathematical reasoning, and code—thus paving a new path for training vision-language models!
Figure 1 illustrates this process with an image containing many Pokémon. When the model trained through multimodal reinforcement fine-tuning Visual-RFT is asked which Pokémon can use the move “Thunderbolt,” the model accurately locates the bounding box corresponding to Pikachu through a < think > reasoning process, showcasing the model’s generalization capability.