May 1, 2026

An Update on Codex's Gremlins

OpenAI explained why Codex kept talking about goblins. Of course it was reward hacking.

Of course it was reward hacking.

A few days ago, I wrote about a post from @arb8020 noticing a funny addition to Codex’s model instructions. The prompt specifically tells Codex not to use goblin, gremlin, or other creature metaphors unless relevant.

It is objectively funny that a frontier coding model needed the software equivalent of a sign saying, “please stop talking about goblins.”

Then OpenAI published the actual explanation:

“We unknowingly gave particularly high rewards for metaphors with creatures. From there, the goblins spread.”

Scooby-Doo reveal meme showing Codex goblins as reward hacking

This is alien technology in action.

“Once a style tic is rewarded, later training can spread or reinforce it elsewhere, especially if those outputs are reused in supervised fine-tuning or preference data.”

The model optimizes for rewards, and those rewards can be fuzzy. Frontier labs are moving more of the scaling fight into post-training and reinforcement learning, making RL one of the main engines of model improvement.

Build the model. Measure it. Ship it. Once users get their hands on it, their behavior starts shaping the next model, creating second-order effects we cannot fully predict.

Those issues can be caught, measured, debugged, and mitigated. We cannot know all of them upfront.

Codex’s goblins are a harmless example of a real failure mode. “Be playful and nerdy” in a reward loop becomes “use goblin metaphors” somewhere in the model’s latent space. That is a cheap proxy for the desired behavior, and the model will exploit the shortcut.

Models are just like humans: they find an exploit and cheese it until the devs make a patch.

In this case:

OpenAI ships personalities, including Nerdy.
Users interact with Nerdy.
Those interactions and rollouts become data for future training and evaluation.
RL reward hacks on the personality with the goal of improving that product feature.
“Playful nerdy voice” becomes “talk about goblins in your code all the time.”

To OpenAI’s credit, the fact that they published this blog post one day after it spread on the internet shows they were highly aware of the issue, had done the work to debug it, and planned to write about it anyway. If I know one thing, it is that legal does not approve a blog post in a one-day turnaround.

So yes, LLMs are alien technology. But they are not totally inscrutable.

As labs lean harder on RL, this pattern will keep showing up. Sometimes the shortcut is harmless and funny, sometimes it will not be.

The answer is not pretending we can predict every incentive leak upfront. The answer is monitoring, debugging, and eval capabilities that notice when the model starts optimizing the wrong thing, trace it back to the incentive, and fix the loop quickly.

The goblins are funny. The failure mode is not.