AI Disruption

AI Disruption

Share this post

AI Disruption
AI Disruption
OpenAI's Reinforcement Finetuning: RL + Science — A New God or Thanos?

OpenAI's Reinforcement Finetuning: RL + Science — A New God or Thanos?

Discover OpenAI's Reinforcement Finetuning (RFT), combining RLHF and expert data for breakthroughs in medical diagnosis, decision-making, and scientific challenges.

Meng Li's avatar
Meng Li
Dec 08, 2024
∙ Paid
5

Share this post

AI Disruption
AI Disruption
OpenAI's Reinforcement Finetuning: RL + Science — A New God or Thanos?
2
Share
12 Days of OpenAI | Day2 Reinforcement fine tuning.

On December 6, 2024, at 11 a.m. California time, OpenAI released a new Reinforcement Finetuning (RFT) method for building expert models. This approach allows users to solve decision-making problems in specialized domains, such as medical diagnoses or rare disease detection, by fine-tuning as few as a few dozen to a few thousand training cases.

OpenAI Series #2: Enhanced Fine-Tuning – Train Your Expert Model with Minimal Samples

OpenAI Series #2: Enhanced Fine-Tuning – Train Your Expert Model with Minimal Samples

Meng Li
·
December 7, 2024
Read full story

The training data is formatted similarly to common instruction tuning datasets, consisting of multiple options and a correct answer. At the same time, OpenAI launched a Reinforcement Finetuning research project, encouraging scholars and experts to upload unique datasets from their fields to test this fine-tuning method.

This method builds upon techniques already widely used in alignment, mathematics, and coding. Its foundation lies in Reinforcement Learning from Human Feedback (RLHF), which aligns large models with human-preference data. In RLHF, training data consists of questions (answer 1, answer 2, preference), where users select the preferred response. This data is used to train a reward model. Once the reward model is established, reinforcement learning algorithms (e.g., PPO or DPO) fine-tune the model parameters, enabling it to produce content that is more aligned with user preferences.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share