· Data Annotations  · 2 min read

RLHF Explained: How Human Feedback Shapes Modern AI

ChatGPT wouldn't exist without it. We dive into Reinforcement Learning from Human Feedback (RLHF) and how it aligns models with human intent.

ChatGPT wouldn't exist without it. We dive into Reinforcement Learning from Human Feedback (RLHF) and how it aligns models with human intent.

Large Language Models (LLMs) are trained to predict the next word. If you train them on the internet, they learn to predict internet comments, including the toxicity and the nonsense. To make them helpful assistants, we need RLHF (Reinforcement Learning from Human Feedback).

The Three Steps of RLHF

  1. Supervised Fine-Tuning (SFT): We teach the model the format of a conversation using high-quality Q&A pairs written by humans.
  2. Reward Modelling: We show the model two potential answers to the same question and ask a human “Which one is better?“. We do this thousands of times to train a separate “Reward Model” that learns what humans prefer (e.g. valid, polite, safe answers).
  3. Reinforcement Learning (PPO): We use the Reward Model to score the main LLM’s outputs. The LLM plays a game against the Reward Model, constantly trying to get a higher score.

The Role of the Annotator

This process puts massive responsibility on the human annotators. They aren’t just labelling “Cat” or “Dog”; they are deciding what “Helpful” and “Harmless” mean. Their cultural biases and preferences get baked into the AI.

At Alps Agility, we work with diverse, highly trained annotation teams to ensuring that the “Human” in “Human Feedback” represents the values of your organisation.

Align your AI. Ensure your model reflects your corporate values. Speak to our Alignment experts.

Back to Knowledge Hub

Related Posts

View All Posts »