DeepSeek Unveils New Paper on Inference-Time Scaling, Is R2 Coming?
DeepSeek's new Self-Principled Critique Tuning (SPCT) boosts AI reward models. Is R2 coming? Read the arXiv paper now!
"AI Disruption" Publication 5500 Subscriptions 20% Discount Offer Link.
A brand-new learning method.
Could this be the prototype of DeepSeek R2? This Friday, the latest paper submitted by DeepSeek to arXiv is gradually heating up in the AI community.
Currently, reinforcement learning (RL) is widely applied to the post-training of large language models (LLMs).
Recent incentives of RL on LLM reasoning capabilities suggest that appropriate learning methods can achieve effective inference-time scalability.
One key challenge of RL is obtaining accurate reward signals for LLMs across various domains beyond verifiable problems or human-defined rules.