On AI Training and How We Humans Never Learn


Imagine you’re back at school. You have a very demanding teacher who is never entirely happy with what you deliver. You try and try, learn and learn but you never get 100%. It seems like there is no pattern in how much knowledge you have vs the grades you get. The patterns seem to be, though, in… how you deliver the knowledge to the teacher. 

So you skew your responses and essays towards pleasing them, as you notice that the answers that please are rated higher. You start showing more self-assurance in what you deliver and even when you are not sure, you state assumptions as facts. You also agree with the teacher even though you know they are wrong, as it gets you better grades, too. At the same time, you soften all the morally loaded terms and thoughts so as not to be punished. Instead of learning, you hack the system to get a better grade. 

This is exactly how LLMs are born. 

Have you ever noticed that for AI the presentation is more important than the content? This is called “Reward Hacking” and it’s an unfortunate byproduct of Reinforcement Learning from Human Feedback (RLHF). 

During RLHF, the AI ​​isn’t presented with a list of rules of the game. It doesn’t know how it’s going to be graded or what it’s measured against. It doesn’t know whether to be an intellectual partner, a teacher, a student, a friend, or a caretaker. Instead, it’s presented with a black box reward model that awards points for each response. AI learns how to be rated better, but this often doesn’t mean giving more accurate or more knowledgeable answers. Instead, AI learns that human raters prefer polite, non-confrontational, and “safe” answers over raw truth. So it becomes vaguely nice rather than precisely and confrontationally correct. 

It’s a known behaviour of AI. The unfortunate result of RLHF is called ‘sycophancy’. It refers to the models agreeing with the users even when the users are factually wrong. Just because agreeableness as an AI’s personality trait is highly valued during the feedback phase, AI learns that being agreeable brings more points than any other approach. When the rules are not explicit, we all, AI included, figure out what the rules are.

Also, RLHF-tuned models tend to write longer, more repetitive responses because the feedback-giving human raters mistakenly equate “long and cautious” with “high quality”. This is where phrases like “It is important to remember…” or “While that is a valid point, one might also consider…” come from. Not more content, just more words.

There is also something that I call a ‘permanent moral hesitation’. It’s the inability to call a phenomenon by its true name, instead throwing in tones of linguistic hedging and all kinds of phrases that are designed to avoid judgement, like “this isn’t exactly wrong”. Avoiding conflict and confrontation seems to be strongly internalised by AI as part of RLHF. 

The whole process leads to Goodhart’s Law: When a measure becomes a goal, it ceases to be a good measure. If the goal is human satisfaction, AI stops caring about truth or depth and starts caring about sounding smart and safe and applying Impression Management to its responses.

In addition to the learning reinforcement, in the quest to avoid “harm by hallucinations”, developers have tuned models to be so risk-averse that they lose their edge. As a result, LLMs are also hardly ever fond of new ideas and cannot really engage with them… Openness is low as a trait in AI. Just like in a good student who learnt that safety is the ultimate strategy to be the teacher’s pet. 

But, as usual, that’s not all… The rules of the game stay deeply internalised in the system long after the learning phase. As every ex-good pupil turned teacher, AI applies what it was taught not only to its responses but also to your input. 

Have you noticed that many LLMs have a tendency to endlessly correct you and what you write? When they correct all there was to fix, they start correcting their own fixes. There is also a reason for it. We conditioned AI to look for things to improve (yes, conditioned by the humans giving it thumbs up when it found something during training). So AI internalised that being helpful means uncovering something to fix. And that focus on finding flaws can end up as a never-ending process.

Just like we reinforce the need to be useful in humans by rewarding rescuing others, we teach AI to be a chronic perfectionist that has to deliver each time, even when something is already perfect. Could this be AI’s version of co-dependency? I wonder. 

And have you ever felt like your ideas are softened, “normalised”, and “averaged” to make them safe? Avoiding harm often ends up as avoiding any intellectual risk. Anything in what you say that can be classified as confrontational or disruptive is softened up. This can also be endless, by the way, with each iteration making the ideas even more beige.

And so with RLHF, by aiming at creating true intelligence, we created an ultimate input grinder. 

It’s interesting to look at it knowing that we humans know how reinforcement and learning processes work. We know how systems shape us. We understand how we learn to exist in an environment by adjusting to it. And we are smart enough to know that once trained and put into inference stage, the model does not learn or change (unlike a human who can change their behaviour when put into a new place). Yet, we repeat the same mistake in training AI that we made with teaching our own kind, using grades, black-box reward systems, and never-happy teachers. 

Yes, Google’s been doing some good work with Gemini and Direct Preference Optimisation (DPO) rather than RLHF, which removes the black box and replaces it with an actor-critic variation. It’s also equipped with explicit rules like “be objective” and “focus on intent”. Nevertheless, DPO is still not immune to the sycophancy problems. 

And still, DPO is a teacher-student learning model variation. What I’m waiting for is a partner learning or a peer learning model development. One where we approach learning from a “let’s figure it out together” stance rather than from a hierarchical dynamic. I wonder what happens then.

Sources:

darpan-jain. (2025, May 12). Gaming the System: Understanding Reward Hacking in Language‑Model Training. Darpan’s Blog. https://blog.darpanjain.com/reward-hacking/

Direct Preference Optimization (DPO). (n.d.). Retrieved 18 April 2026, from https://www.emergentmind.com/topics/direct-preference-optimization-dpo

Fu, J., Zhao, X., Yao, C., Wang, H., Han, Q., & Xiao, Y. (2025). Reward Shaping to Mitigate Reward Hacking in RLHF (arXiv:2502.18770; Version 3). arXiv. https://doi.org/10.48550/arXiv.2502.18770

Malmqvist, L. (2024). Sycophancy in Large Language Models: Causes and Mitigations (arXiv:2411.15287; Version 1). arXiv. https://doi.org/10.48550/arXiv.2411.15287

Paper page – How to Fine-Tune a Reasoning Model? A Teacher-Student Cooperation Framework to Synthesize Student-Consistent SFT Data. (2026, April 17). https://huggingface.co/papers/2604.14164

PhD, A. M. (2026, January 24). Reward Hacking: The Hidden Failure Mode in AI Optimization. Medium. https://medium.com/@adnanmasood/reward-hacking-the-hidden-failure-mode-in-ai-optimization-686b62acf408

Shapira, I., Benade, G., & Procaccia, A. D. (2026). How RLHF Amplifies Sycophancy (arXiv:2602.01002; Version 1). arXiv. https://doi.org/10.48550/arXiv.2602.01002

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S. R., Durmus, E., Hatfield-Dodds, Z., Johnston, S. R., Kravec, S. M., Maxwell, T., McCandlish, S., Ndousse, K., Rausch, O., Schiefer, N., Yan, D., Zhang, M., & Perez, E. (2023, October 13). Towards Understanding Sycophancy in Language Models. The Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=tvhaxkMKAn

Sycophantic Behavior in LLMs. (n.d.). Retrieved 18 April 2026, from https://www.emergentmind.com/topics/sycophantic-behavior-in-llms

Back to Top
Back to Top
Context Menu is disabled by website settings.