Sundeep Teki
  • Home
    • About
  • AI
    • Training >
      • Testimonials
    • Consulting
    • Papers
    • Content
    • Hiring
    • Speaking
    • Course
    • Neuroscience >
      • Speech
      • Time
      • Memory
    • Testimonials
  • Coaching
    • Advice
    • Career Guides
    • Company Guides
    • Research Engineer
    • Research Scientist
    • Forward Deployed Engineer
    • AI Engineer
    • Testimonials
  • Blog
  • Contact
    • News
    • Media

The Complete Guide to Post-Training LLMs: How SFT, RLHF, DPO, and GRPO Shape LLMs

8/4/2026

0 Comments

 
​Table of Contents

1. Introduction

2. What Is Post-Training? The Hidden Stage That Defines Model Quality
2.1 Post-Training vs. Fine-Tuning: A Critical Distinction
2.2 The Three-Stage Pipeline: SFT, Preference Alignment, and Reinforcement Learning
2.3 Why Post-Training Now Accounts for the Majority of Usable Model Capability

3. Supervised Fine-Tuning (SFT): Teaching Models to Follow Instructions
3.1 Full Fine-Tuning, LoRA, and QLoRA - Choosing Your Approach
3.2 Dataset Quality: The Accuracy-Diversity-Complexity Triad
3.3 The Dataset Composition Blueprint

4. Preference Alignment: Making Models Helpful, Harmless, and Honest
4.1 RLHF - The Original Breakthrough
4.2 DPO - Eliminating the Reward Model
4.3 RLAIF and Constitutional AI - Anthropic's Scalable Alternative

5. Reinforcement Learning: The Frontier of Reasoning Models
5.1 GRPO - DeepSeek's Paradigm Shift
5.2 DAPO and RLVR - Verifiable Rewards for Reasoning
5.3 How OpenAI, Anthropic, and Google DeepMind Approach RL Differently

6. The Post-Training Toolkit: Libraries, Infrastructure, and Compute
6.1 Unsloth vs. TRL - Beginner-Friendly vs. Research-Grade
6.2 Compute Requirements and Cost Considerations

7. Post-Training Careers: Roles, Salaries, and How to Break In
7.1 The Exploding Demand for Post-Training Specialists
7.2 Interview Questions You Should Expect

8. The Complete Post-Training Preparation Roadmap
8.1 Weeks 1-4: Foundations
8.2 Weeks 5-8: Implementation
8.3 Weeks 9-12: Advanced Techniques and Portfolio Building

9. Conclusion: Post-Training Is Where AI Capability Is Won
​
10. 1-1 AI Career Coaching

1. Introduction


Post-training is now where the majority of a large language model's usable capability is created. This is the central finding of this analysis, and it has profound implications for anyone building, deploying, or seeking a career in AI. The transformation from a raw base model into ChatGPT, Claude, or Gemini happens not during pre-training, but during post-training.
​

Yet despite its outsized importance, post-training remains one of the least understood stages of the LLM development pipeline. Most public discourse fixates on pre-training - the massive compute clusters, the trillions of tokens, the scaling laws. Post-training, by contrast, operates in relative obscurity, even though the techniques pioneered here - Supervised Fine-Tuning (SFT), Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO) - are what separate a research artifact from a product that hundreds of millions of people use every day.

This guide provides a comprehensive, practitioner-oriented deep-dive into the full post-training pipeline. Whether you are an ML engineer looking to specialise, a researcher evaluating alignment techniques, or a career switcher preparing for interviews at frontier AI labs, this analysis covers the technical foundations, the strategic landscape, and the career implications of mastering post-training. As I explored in my AI Research Engineer interview guide and the AI Research Scientist interview guide, understanding these techniques at depth is increasingly non-negotiable for anyone targeting roles at OpenAI, Anthropic, or Google DeepMind.

2. What Is Post-Training? The Hidden Stage That Defines Model Quality


2.1 Post-Training vs. Fine-Tuning: A Critical Distinction

One of the most common sources of confusion in applied AI is the conflation of "post-training" with "fine-tuning." These are not synonyms. The distinction is structural, not semantic, and understanding it is essential for both technical practitioners and career strategists.

Post-training refers to the general-purpose alignment and instruction-tuning process that model providers like OpenAI, Anthropic, and Google DeepMind perform on base models to create the instruct or chat variants that ship as products. It typically involves datasets exceeding one million examples, spans multiple training stages (SFT, preference alignment, and increasingly reinforcement learning), and aims to produce a model that is broadly helpful, harmless, and honest across the full distribution of user queries.

Fine-tuning, by contrast, is a task-specific or domain-specific adaptation performed by downstream users or enterprises. It uses smaller datasets - typically 10,000 to one million examples - and optimises the model for a narrow use case: a legal document classifier, a medical coding assistant, a customer support chatbot for a specific product line. Fine-tuning takes an already post-trained model and sharpens it further.

The practical implication is clear: if you are building a product on top of GPT-4 or Claude, you are fine-tuning. If you are working at a frontier lab creating the next version of those models, you are doing post-training. Both require deep knowledge of the same underlying techniques - SFT, LoRA, preference optimisation - but the scale, the dataset curation challenges, and the evaluation frameworks differ substantially.

2.2 The Three-Stage Pipeline: SFT, Preference Alignment, and Reinforcement Learning

The modern post-training pipeline as confirmed by publications from all three major frontier labs, follows a three-stage architecture:

Stage 1 - Supervised Fine-Tuning (SFT):
The base model is trained on high-quality instruction-response pairs to learn the format, tone, and structure of helpful dialogue. This is the stage that transforms an autocomplete engine into something that can follow instructions.


Stage 2 - Preference Alignment (DPO or RLHF):
The SFT model is further refined using human preference data - pairs of responses where one is judged better than the other. This stage teaches the model not just what to say, but which of several plausible responses is most helpful, accurate, and safe. The output of this stage is the "instruct model" - the product that most users interact with.


Stage 3 - Reinforcement Learning with Verifiable Rewards (GRPO, DAPO, RLVR):
This is the newest and most rapidly evolving stage, pioneered by DeepSeek's R1 model in early 2025. Here, the model is trained using reinforcement learning on tasks with objectively verifiable answers - mathematical proofs, code execution, logical reasoning chains. The output is a "thinking model" or "reasoning model" that exhibits extended chain-of-thought reasoning.


This three-stage pipeline represents a significant evolution from the two-stage process (SFT + RLHF) that defined the 2022-2024 era. The addition of the third stage - RL with verifiable rewards - is what has enabled the rapid improvement in reasoning capabilities that distinguishes models like DeepSeek-R1, OpenAI's o1 and o3, and Anthropic's Claude Opus 4 from their predecessors.

2.3 Why Post-Training Now Accounts for the Majority of Usable Model Capability
​
The data on this point is striking. Liquid AI's benchmarks on their LFM 2.5 model demonstrate that post-training alone can improve benchmark performance by 20-40% across standard evaluations - a magnitude of improvement that would require orders of magnitude more pre-training compute to achieve through scaling alone. Research from Meta's Llama team shows similar results: the gap between Llama 3.1 base and Llama 3.1 instruct on user-facing tasks is not incremental; it is transformational.
​

This is not a productivity boost; it is a structural shift in where value is created in the AI development pipeline. For engineers and researchers, the implication is that post-training expertise is no longer a specialisation - it is a core competency. For companies, it means that competitive advantage increasingly lies not in who can pre-train the biggest model, but in who can post-train the most capable one.

3. Supervised Fine-Tuning (SFT): Teaching Models to Follow Instructions


3.1 Full Fine-Tuning, LoRA, and QLoRA - Choosing Your Approach

Supervised Fine-Tuning is the foundation of the post-training pipeline, and the choice of technique here has significant implications for compute cost, model quality, and practical deployment. Three approaches dominate the landscape, each with distinct tradeoffs that practitioners need to understand in depth.

Full Fine-Tuning (FP16) updates every parameter in the model using 16-bit floating-point precision. This is the gold standard for quality - it allows the model to adapt its entire weight space to the new data distribution. However, the compute and memory requirements are substantial. Fine-tuning a 70B parameter model in FP16 requires multiple high-end GPUs (typically 4-8 A100 80GB or H100 GPUs), and the training process can take days even on modern hardware. Full fine-tuning is the default choice at frontier labs where compute is abundant and maximum quality is non-negotiable.

LoRA (Low-Rank Adaptation) represents a paradigm shift in parameter-efficient fine-tuning. Instead of updating all parameters, LoRA freezes the base model and injects small trainable matrices into each transformer layer, typically reducing the number of trainable parameters by 90-99%. Operating at 16-bit precision, LoRA achieves 85-95% of full fine-tuning quality at a fraction of the compute cost. A 70B model can be LoRA fine-tuned on a single A100 GPU. The research, originally published by Hu et al. at Microsoft in 2021, has since been validated at scale by teams at Meta, Google, and dozens of startups building production fine-tuning pipelines.

QLoRA (Quantized Low-Rank Adaptation) pushes efficiency further by quantizing the base model to 4-bit precision before applying LoRA adapters. Introduced by Dettmers et al. in 2023, QLoRA enables fine-tuning of a 70B model on a single consumer GPU with 24GB of VRAM - a democratisation of access that has fuelled the open-source model explosion. The quality tradeoff is real but often acceptable: QLoRA typically achieves 80-90% of full fine-tuning quality, which is more than sufficient for many production applications.

The decision framework is straightforward. Use full fine-tuning when you have the compute and need maximum quality (frontier lab post-training). Use LoRA when you need a strong balance of quality and efficiency (enterprise fine-tuning, research prototyping). Use QLoRA when compute is constrained or you are iterating rapidly on dataset experiments (startups, individual researchers, academic labs).

3.2 Dataset Quality: The Accuracy-Diversity-Complexity Triad

The single most important insight from practitioners working on SFT at scale is that dataset quality dominates dataset quantity. A model fine-tuned on 10,000 meticulously curated examples will consistently outperform one fine-tuned on 100,000 noisy examples. This finding has been replicated across multiple studies, including the LIMA paper from Meta (2023) which demonstrated near-GPT-4 quality with just 1,000 carefully selected instruction-response pairs.

There are three pillars of dataset quality that every practitioner must optimise for:

1 Accuracy is the most obvious requirement but also the most treacherous. Every instruction-response pair must be factually correct and appropriately formatted. A single category of systematic errors - say, consistently hallucinated citations in academic-style responses - can propagate through the entire model's behaviour distribution. Quality assurance at scale requires a combination of automated verification (checking code examples execute correctly, validating mathematical derivations) and human review (assessing response helpfulness, tone, and safety).

2 Diversity ensures the model develops broad capability rather than overfitting to a narrow distribution. A post-training dataset must span a wide range of instruction types (open-ended questions, step-by-step tasks, creative writing, code generation, multi-turn conversation), domains (science, law, medicine, casual conversation), and difficulty levels. The research indicates that even a small percentage of underrepresented instruction types can cause catastrophic forgetting in those domains during SFT.

3 Complexity is perhaps the most under-appreciated dimension. Training on simple, single-step instructions produces a model that struggles with multi-step reasoning, nuanced analysis, and compositional tasks. The most effective SFT datasets deliberately include complex, multi-turn interactions that require the model to maintain context, handle ambiguity, and synthesise information across multiple steps.

3.3 The Dataset Composition Blueprint

The empirical distribution of a successful post-training SFT dataset, as revealed by analysis of the SmolLM2 dataset composition, follows a pattern that would be familiar to anyone who has built production ML datasets: Math (39.4%), Code (38.9%), Chat/Conversation (17.6%), and Instruction Following (4.1%).


The heavy weighting toward math and code is not accidental. These domains provide the clearest signal for training - there is an objectively correct answer, and the model can be evaluated against it. Chat and instruction following, while critical for user experience, carry noisier reward signals and benefit from smaller but higher-quality datasets. This composition reflects a broader truth about post-training: the easiest domains to train on are those with verifiable ground truth, and the hardest are those that require subjective judgement. Getting the balance right is as much art as science, and it represents one of the most closely guarded secrets at frontier labs.

4. Preference Alignment: Making Models Helpful, Harmless, and Honest


4.1 RLHF - The Original Breakthrough

Reinforcement Learning from Human Feedback (RLHF) is the technique that bridged the gap between "a model that can follow instructions" and "a model that users actually want to interact with." Pioneered by OpenAI and Anthropic between 2020 and 2022, RLHF was the critical innovation that enabled the launch of ChatGPT and transformed AI from a research curiosity into a consumer product used by hundreds of millions.

The RLHF pipeline involves three components: a supervised fine-tuned model (the policy), a reward model trained on human preference data, and a reinforcement learning algorithm (typically PPO - Proximal Policy Optimization) that optimises the policy to maximise the reward model's scores while staying close to the original SFT model's distribution. Human annotators compare pairs of model responses and select the better one, generating the preference data that trains the reward model.

The technique is powerful but expensive. Collecting high-quality human preference data costs between $1 and $5 per comparison, and a typical RLHF training run requires hundreds of thousands of comparisons. At scale, this translates to millions of dollars in annotation costs alone, before accounting for the compute required for the RL training loop. The reward model itself introduces a layer of complexity - it must be large enough to capture nuanced quality distinctions but efficient enough to serve as a real-time scoring function during RL training.

Despite these challenges, RLHF remains the backbone of post-training at most frontier labs. OpenAI's GPT-4 and GPT-5 both use hybrid RLHF approaches that combine human preference data with model-generated comparisons. Google DeepMind's Gemini models undergo extensive RLHF with PPO, maintaining the most traditional implementation of the original pipeline. The technique works, and its results are empirically validated at scale.

4.2 DPO - Eliminating the Reward Model

Direct Preference Optimization (DPO), introduced by Rafailov et al. at Stanford in 2023, represents a mathematical insight that has reshaped the alignment landscape: you do not need a separate reward model. DPO reformulates the RLHF objective as a simple classification loss that can be applied directly to the language model using the same preference data. Instead of training a reward model, running an RL loop, and carefully managing the KL-divergence constraint, DPO achieves equivalent alignment quality with a single supervised training step.

The practical advantages are substantial. DPO eliminates the most unstable component of the RLHF pipeline - the RL training loop with PPO, which is notoriously sensitive to hyperparameters and prone to reward hacking. It reduces compute requirements by approximately 50% compared to full RLHF, since there is no separate reward model to train or serve. And it simplifies the engineering infrastructure required, making preference alignment accessible to teams that lack the specialised RL engineering expertise that RLHF demands.

The research evidence for DPO's effectiveness is now extensive. The original Stanford paper demonstrated that DPO matches or exceeds RLHF quality on standard alignment benchmarks. Subsequent work from teams at Meta, Mistral, and the open-source community has confirmed these findings at scale. DPO has become the default alignment technique for open-source model development and is increasingly used alongside RLHF at frontier labs.

The central question for practitioners is not whether DPO works - the data suggests it clearly does - but when to choose it over RLHF. The emerging consensus is that DPO excels for standard instruction-following alignment but may underperform RLHF for the most complex safety-critical behaviours, where the nuance captured by a dedicated reward model provides additional value. Most frontier labs now use both: DPO for the initial alignment pass and targeted RLHF for safety-critical domains.

4.3 RLAIF and Constitutional AI - Anthropic's Scalable Alternative

Anthropic has pioneered a fundamentally different approach to preference alignment that replaces human annotators with AI feedback - a technique known as RLAIF (Reinforcement Learning from AI Feedback) and operationalised through their Constitutional AI framework.

The economics of this approach are transformative. While human feedback costs $1 to $5 per comparison, AI-generated feedback costs less than $0.01 per comparison - a cost reduction of two to three orders of magnitude. Anthropic's Constitutional AI framework defines a set of principles (the "constitution" - most recently updated to an 80-page document in 2025) that guide the AI's evaluation of responses. The model critiques its own outputs against these principles, generating synthetic preference data that is then used for DPO or RLHF training.

The quality question is nuanced. Research from Anthropic published in 2023-2024 demonstrates that RLAIF achieves comparable quality to human RLHF for the majority of alignment dimensions, with particular strength in consistency - an AI evaluator applies the same standards uniformly, while human annotators exhibit significant inter-rater variability. Where RLAIF falls short is in capturing novel edge cases and culturally contextualised judgements that require lived human experience. Anthropic addresses this gap with a hybrid approach: RLAIF for the bulk of preference data generation, supplemented by targeted human annotation for safety-critical categories.
​
This approach has significant implications for the competitive landscape. It suggests that alignment quality will increasingly be determined not by who can afford the most human annotators, but by who can design the most effective constitutional principles and AI evaluation frameworks. As I discussed in my analysis of context engineering for production-grade AI systems, the quality of the system architecture - in this case, the constitution and evaluation pipeline - matters more than brute-force scaling of any single component.

5. Reinforcement Learning: The Frontier of Reasoning Models


5.1 GRPO - DeepSeek's Paradigm Shift

Group Relative Policy Optimization (GRPO), introduced by DeepSeek in their R1 paper in January 2025, is the most consequential innovation in post-training since the original RLHF breakthrough. GRPO eliminates both the reward model and the critic network - two of the most computationally expensive and unstable components of the traditional RL pipeline - and replaces them with a remarkably elegant mechanism: group-relative scoring.

The mechanism works as follows. For each prompt, the model generates a group of multiple responses (typically 8-16). These responses are scored against a verifiable reward function - for mathematical problems, whether the answer is correct; for coding tasks, whether the code passes test cases. Each response's advantage is computed relative to the group mean, and the policy is updated to increase the probability of above-average responses and decrease the probability of below-average ones. There is no learned reward model to overfit, no critic network to train, and no complex PPO-style clipping to manage.

The results have been extraordinary. DeepSeek-R1, trained primarily with GRPO, achieved reasoning performance competitive with OpenAI's o1 model at a fraction of the training cost. Independent reproductions by the open-source community have confirmed that GRPO can induce chain-of-thought reasoning, self-correction, and multi-step problem-solving capabilities that were previously thought to require massive-scale RLHF pipelines. The technique has been rapidly adopted: within months of the R1 paper, GRPO implementations appeared in Hugging Face's TRL library, and multiple startups and academic labs reported successful replications.

The strategic implications are significant. GRPO dramatically lowers the compute barrier to training reasoning models, shifting the competitive advantage from compute access to dataset design and reward function engineering. This connects directly to a theme I explored in my analysis of Nvidia's AI moat - as algorithmic efficiency improves, the moat shifts from raw hardware to the quality of the training pipeline and the tacit knowledge of the team operating it.

5.2 DAPO and RLVR - Verifiable Rewards for Reasoning

GRPO opened the door, and a rapid succession of innovations has followed. DAPO (Decoupled Alignment and Policy Optimization) extends GRPO by separating the alignment objective from the policy optimisation step, allowing practitioners to maintain safety constraints while aggressively optimising for reasoning capability. Early results suggest DAPO achieves better alignment-capability tradeoffs than standard GRPO on safety-sensitive reasoning tasks.

RLVR (Reinforcement Learning with Verifiable Rewards) represents the broader paradigm that GRPO exemplifies: training language models using reinforcement learning where the reward signal comes from an objectively verifiable outcome rather than a learned reward model. The key insight is that for a surprisingly large class of valuable tasks - mathematics, formal logic, code generation, structured data extraction, constraint satisfaction - the correctness of the output can be programmatically verified. This eliminates the reward model entirely and provides a training signal that is both cheaper and more reliable than human preference data.

The research frontier is moving rapidly. Teams at OpenAI, Google DeepMind, and multiple academic labs are exploring RLVR for domains beyond pure reasoning - including tool use (did the agent achieve the goal?), code generation (does the program pass all tests?), and structured output (does the JSON conform to the schema?). The central question is how far verifiable rewards can be extended before they hit the boundary of tasks that require genuinely subjective evaluation.

5.3 How OpenAI, Anthropic, and Google DeepMind Approach RL Differently

Each frontier lab has developed a distinctive philosophy toward reinforcement learning in post-training, reflecting their broader organisational cultures and technical bets.

OpenAI has pursued the most aggressive RL scaling strategy. Their o1 and o3 reasoning models represent the state of the art in RL-trained language models, using a proprietary pipeline that reportedly combines RLHF, process reward models (which provide feedback at each reasoning step rather than just the final answer), and massive-scale RL training runs. GPT-5 employs a hybrid approach that integrates RLHF with model-generated preference data at unprecedented scale. OpenAI's bet is that RL will continue to yield returns as it scales, and they have invested accordingly in both the infrastructure and the human annotation workforce to support this.

Anthropic takes a characteristically different approach, emphasising AI feedback and constitutional constraints over brute-force RL scaling. Their Claude models are trained using Constitutional AI, which combines RLAIF with carefully engineered principles rather than raw human preference data. Anthropic's 2025-era constitution runs to approximately 80 pages and encodes nuanced safety and helpfulness criteria that guide the AI evaluation process. This approach trades some raw performance for greater consistency and controllability - a tradeoff that reflects Anthropic's mission-driven emphasis on safety.

Google DeepMind maintains the most research-oriented approach, publishing extensively on novel RL techniques and maintaining closer ties to the academic RL community. Their Gemini models use SFT followed by RLHF with PPO - the most traditional implementation of the original pipeline - but supplemented by cutting-edge research on reward model robustness, multi-objective optimisation, and process-based feedback. DeepMind's advantage is breadth of research capability and tight integration with Google's infrastructure; their constraint is the complexity of aligning research timelines with product deployment cycles.

Understanding these differences is not merely academic - it directly informs interview preparation. As I detailed in my Research Engineer interview guide and my Research Scientist interview guide, each lab's interview process reflects its technical philosophy. OpenAI will test your ability to implement and debug RL training loops at speed. Anthropic will probe your understanding of alignment tradeoffs and constitutional principles. DeepMind will expect you to discuss the theoretical foundations of RL algorithms and evaluate research directions with taste and rigour. For Research Scientist candidates in particular, the ability to propose novel post-training research directions - not just implement existing techniques - is the differentiator that separates a hire from a reject.

6. The Post-Training Toolkit: Libraries, Infrastructure, and Compute


6.1 Unsloth vs. TRL - Beginner-Friendly vs. Research-Grade

Two libraries dominate the post-training landscape, and choosing between them is one of the first practical decisions any practitioner must make.

Unsloth has emerged as the go-to library for practitioners who need to get fine-tuning working quickly and efficiently. It provides optimised implementations of SFT, LoRA, and QLoRA with automatic memory management, pre-configured training recipes, and 2-5x speedups over baseline Hugging Face Transformers training through custom CUDA kernels. Unsloth's documentation is deliberately beginner-friendly, and it supports the most popular model architectures (Llama, Mistral, Phi, Gemma) out of the box. For enterprise fine-tuning, rapid prototyping, and educational use, Unsloth is the correct starting point.

TRL (Transformer Reinforcement Learning) is Hugging Face's research-grade library that provides implementations of the full post-training pipeline: SFT, DPO, PPO, GRPO, and more experimental techniques. TRL offers significantly more flexibility and configurability than Unsloth, at the cost of a steeper learning curve and more manual configuration. If you need to implement a novel reward function, experiment with GRPO variants, or reproduce a specific paper's training pipeline, TRL is the necessary tool.

The practical recommendation is to use both. Start with Unsloth for initial SFT and dataset experiments where iteration speed matters most. Move to TRL when you need DPO, GRPO, or custom RL training loops. For interview preparation, you should be fluent in both - Unsloth demonstrates practical engineering sense, while TRL demonstrates research depth.

​6.2 Compute Requirements and Cost Considerations
The compute landscape for post-training has evolved rapidly, and practitioners need updated mental models for what is achievable at each price point.

For SFT with QLoRA on a 7-8B parameter model, a single A100 40GB or H100 GPU suffices, with training completing in 2-6 hours for a typical dataset of 50,000-100,000 examples. Cloud cost: approximately $10-30 per training run on Lambda Labs or RunPod. For SFT with LoRA on a 70B model, you need 1-2 A100 80GB or H100 GPUs, with training taking 12-48 hours. Cloud cost: approximately $100-500 per run. Full fine-tuning of a 70B model requires 4-8 H100s and can take several days. Cloud cost: $1,000-5,000 per run.
​

DPO adds approximately 30-50% to the SFT compute cost, since it requires forward passes through two models (the policy and the reference model). GRPO is more expensive still - generating multiple responses per prompt at training time multiplies inference cost by the group size (8-16x), though the elimination of the reward model partially offsets this.
The takeaway for career-minded practitioners: you can build a compelling portfolio of post-training projects for under $500 in cloud compute, using QLoRA and open-source models. The barrier to entry has never been lower.

7. Post-Training Careers: Roles, Salaries, and How to Break In


7.1 The Exploding Demand for Post-Training Specialists

The demand for engineers and researchers with post-training expertise has accelerated faster than almost any other AI specialisation. According to the 2025 Dice Tech Salary Report, AI engineers earned an average of $206,000 in the United States, representing a 4.5% year-over-year increase. But these averages obscure the true premium for post-training specialists: roles specifically focused on RLHF, alignment, and model fine-tuning at frontier labs command compensation packages of $200,000 to $312,000 for individual contributors, with senior and staff-level positions exceeding $400,000 at OpenAI, Anthropic, and Google DeepMind.

The job titles vary across organisations - "Post-Training Engineer," "Alignment Researcher," "RLHF Scientist," "Fine-Tuning Engineer," "Model Behaviour Specialist" - but the core competency is consistent: deep fluency in SFT, preference optimisation, and increasingly, RL-based training techniques. A search across major job boards reveals a 3x increase in listings mentioning "post-training" or "RLHF" between January 2025 and March 2026, outpacing the growth of general ML engineering roles over the same period.


7.2 Interview Questions You Should Expect

Based on my experience coaching candidates through interviews at all major frontier labs, here are the post-training questions that appear most frequently:

Technical Depth Questions:
  • Explain the RLHF pipeline end-to-end. Where can it fail, and how would you debug each failure mode?
  • Compare DPO and PPO-based RLHF. When would you choose one over the other?
  • What is GRPO, and why did DeepSeek's approach achieve competitive results at lower cost?
  • How does LoRA work mathematically? What determines the choice of rank?
  • Describe the KL-divergence constraint in RLHF. Why is it necessary, and what happens without it?

System Design Questions:
  • Design a post-training pipeline for a 70B model that needs to be helpful, harmless, and capable of multi-step reasoning. What stages would you include, and in what order?
  • How would you build a scalable human annotation pipeline for RLHF preference data? What quality control mechanisms would you implement?
  • Design a reward function for a code generation model. How would you handle edge cases where the code is correct but inefficient?

Research Taste Questions:
  • What are the limitations of DPO compared to RLHF? Is the field converging on one approach?
  • How would you extend GRPO to tasks without verifiable rewards?
  • What is the role of Constitutional AI in alignment? What are its strengths and weaknesses compared to RLHF?

8. The Complete Post-Training Preparation Roadmap


8.1 Weeks 1-4: Foundations

The first four weeks should establish your theoretical and practical foundations. Begin with a thorough study of the SFT pipeline: read the original LoRA paper (Hu et al., 2021), the QLoRA paper (Dettmers et al., 2023), and Maxime Labonne's post-training primer. Implement SFT with QLoRA on a 7B model using Unsloth - choose an open dataset like OpenHermes or SlimOrca, and train a model that you can interact with and evaluate qualitatively.

Simultaneously, build your understanding of the preference alignment landscape. Read the original RLHF paper (Christiano et al., 2017), the InstructGPT paper (Ouyang et al., 2022), and the DPO paper (Rafailov et al., 2023). Understand the mathematical relationship between RLHF and DPO - they optimise the same objective under different formulations, and understanding this equivalence is frequently tested in interviews.

8.2 Weeks 5-8: Implementation
Shift from reading to building. Implement DPO training using TRL on a preference dataset (UltraFeedback is a strong starting point). Compare the results qualitatively and quantitatively against your SFT-only model. Document the differences in helpfulness, safety, and response quality - this comparison becomes a powerful portfolio artifact.

Then tackle the frontier: implement GRPO on a mathematical reasoning task. Use TRL's GRPO trainer with a simple verifiable reward function (mathematical correctness). This is harder than SFT or DPO - you will need to manage group generation, advantage computation, and careful learning rate scheduling. The experience of debugging a GRPO training run is invaluable preparation for both interviews and real-world post-training work.

8.3 Weeks 9-12: Advanced Techniques and Portfolio Building
The final four weeks should focus on depth and differentiation. Choose one area to go deep: Constitutional AI and RLAIF (implement a simple constitution and evaluate its effect on model behaviour), process reward models (implement step-by-step evaluation for mathematical reasoning), or multi-objective alignment (train a model to balance helpfulness, safety, and honesty using a combination of DPO and targeted RLHF).

Build a portfolio that demonstrates both breadth and depth. A strong post-training portfolio includes: one SFT project demonstrating dataset curation and training hygiene, one DPO/RLHF project showing preference alignment, one GRPO/RLVR project demonstrating reasoning enhancement, and a write-up comparing approaches with quantitative evaluation. Host your models on Hugging Face and write detailed technical blog posts documenting your process - these artifacts signal exactly the kind of practitioner capability that hiring managers at frontier labs are seeking.

9. Conclusion: Post-Training Is Where AI Capability Is Won


The transformation from a base model to a product-grade AI system happens during post-training, and the techniques involved - SFT, DPO, RLHF, GRPO, Constitutional AI - represent one of the most dynamic and consequential areas of applied AI research.

The landscape is evolving rapidly. GRPO and verifiable reward approaches are expanding the frontier of what RL-trained models can achieve. DPO has democratised preference alignment. RLAIF is reshaping the economics of human feedback. And the emergence of a distinct post-training career track - with compensation premiums and dedicated roles at every major AI company - reflects the growing recognition that post-training is not a supporting function but a primary driver of model capability.

For practitioners, the path forward is clear: build foundational fluency across the full pipeline, develop depth in at least one frontier technique (GRPO, Constitutional AI, or process reward models), and create portfolio artifacts that demonstrate both theoretical understanding and practical implementation skill. The barrier to entry has never been lower - QLoRA and open-source models put production-grade post-training experiments within reach of anyone with a cloud GPU and the motivation to learn.
​
The central finding of this analysis bears repeating: the majority of what makes an AI model useful is created during post-training. Master these techniques, and you are not just learning a specialisation - you are positioning yourself at the exact point where AI capability is won.

10. 1-1 AI Career Coaching


The post-training landscape is moving faster than any individual can track alone. New techniques emerge monthly - GRPO was unknown eighteen months ago; today it is reshaping how every frontier lab trains reasoning models. For engineers and researchers navigating this space, the difference between a well-timed career move and a missed opportunity often comes down to having a strategic perspective that goes beyond technical knowledge.

Here is what you get in a coaching
engagement for Research Scientist and Engineer:
  • Personalised assessment of your post-training readiness and skill gaps against specific target roles at frontier labs
  • Deep-dive preparation for RLHF, DPO, and GRPO interview questions tailored to each company's technical philosophy
  • Portfolio strategy to build post-training projects that demonstrate production-grade capability
  • End-to-end application strategy covering resume optimisation, networking at target companies, and timeline management

Post-training expertise is now central to both Research Engineer and Research Scientist roles at frontier labs. Explore my AI Research Scientist interview guide for a comprehensive breakdown of how to prepare for RS roles where post-training research is the core focus, my AI Research Engineer interview guide for the implementation-focused track, or my Company-specific guides to getting hired at OpenAI, Anthropic & DeepMind for detailed breakdowns of each lab's interview process and culture.

Book a free discovery call,
with your current role, target companies, and timeline to build a personalised plan for breaking into post-training at the world's top AI labs.
0 Comments

Your comment will be posted after it is approved.


Leave a Reply.

    Check out my AI Career Coaching Programs for:
    - Research Engineer
    - Research Scientist 
    - AI Engineer
    - FDE


    Archives

    April 2026
    March 2026
    January 2026
    November 2025
    August 2025
    July 2025
    June 2025
    May 2025


    Categories

    All
    Advice
    AI Engineering
    AI Research
    AI Skills
    Big Tech
    Career
    India
    Interviewing
    LLMs


    Copyright © 2025, Sundeep Teki
    All rights reserved. No part of these articles may be reproduced, distributed, or transmitted in any form or by any means, including  electronic or mechanical methods, without the prior written permission of the author. 
    ​

    Disclaimer
    This is a personal blog. Any views or opinions represented in this blog are personal and belong solely to the blog owner and do not represent those of people, institutions or organizations that the owner may or may not be associated with in professional or personal capacity, unless explicitly stated.

    RSS Feed

​[email protected] | Book a Call
​​  ​© 2026 Sundeep Teki
  • Home
    • About
  • AI
    • Training >
      • Testimonials
    • Consulting
    • Papers
    • Content
    • Hiring
    • Speaking
    • Course
    • Neuroscience >
      • Speech
      • Time
      • Memory
    • Testimonials
  • Coaching
    • Advice
    • Career Guides
    • Company Guides
    • Research Engineer
    • Research Scientist
    • Forward Deployed Engineer
    • AI Engineer
    • Testimonials
  • Blog
  • Contact
    • News
    • Media