March 27, 2026·2 min read

Human-in-the-loop, reinforcement learning from feedback, and building on Turing

Notes on RLHF-style loops, why human judgment still matters, and how my work on Turing fits into that picture—with links to read alongside.

rlhf

reinforcement-learning

human-feedback

turing

Humans in the loop are not a bug

A lot of “AI” headlines pretend the model is autonomous. In practice, human feedback—ranking outputs, correcting tone, catching hallucinations—is what makes systems usable. That pattern sits under ideas like RLHF (reinforcement learning from human feedback): reward signals come from people (or from models trained to mimic people), not only from static loss on a dataset.

A readable on-ramp is Hugging Face’s Illustrated RLHF, which walks through preference data, reward modeling, and policy tuning without requiring a PhD to get value from the diagram.

For the research lineage, OpenAI’s Learning to summarize with human feedback is one of the papers that popularized the modern recipe; Anthropic’s writing on Constitutional AI explores related ideas where AI feedback augments human oversight—still contested, still evolving.

What I mean by reinforcement learning here

I am not claiming a neat lab setup on every task. In product work, “RL” often shows up as iterate from evals: ship, measure, label failures, change prompts or tools, repeat. DeepMind’s scalable RL in complex environments is the research-heavy end of the spectrum; your dashboard and error taxonomy are the day-job version of the same instinct.

Turing

I am currently working in the Turing ecosystem—Turing.com connects engineers with serious remote roles and, in my case, intersects with work where human judgment and model behavior have to align (think evaluation, refinement, and the kind of feedback loops that make deployed AI less brittle).

If you are exploring similar work, their developer-focused pages are the canonical entry point; compare that with general RLHF reading above and you start to see the same theme: models improve when human intent is explicit in the loop.

Links quick list

Illustrated RLHF (Hugging Face) — https://huggingface.co/blog/rlhf
OpenAI: Learning to summarize with human feedback — https://arxiv.org/abs/2009.01325
Anthropic: Constitutional AI — https://www.anthropic.com/news/constitutional-ai-harmlessness-from-ai-feedback
DeepMind research (RL / agents) — https://deepmind.google/research/
Turing — https://www.turing.com/

Closing

If your work touches human reinforcement signals, orchestrated agents, or evaluation at scale, we are probably solving adjacent puzzles. More on the rest of my stack and projects on yabibal.site.

Post to your network or copy the link.

LinkedIn X Facebook Reddit WhatsApp Email

Humans in the loop are not a bug

What I mean by reinforcement learning here

Turing

Links quick list

Closing

Share

Related