Steering, Not Censoring: A Benchmark Suite for Safe and Creative Open-Source AI
This one is for Clément Delangue, the CEO of Hugging Face. He has been out there reminding everyone that "open" and "responsible" have to grow together. This article is for him, and for everyone else in the community who actually cares about what AI should do, not just what it can do.
1. The Reality Check: When Capability Outruns Caution
We are living through an insane time for open-source AI. Seriously, it is moving so fast.
It feels like every few weeks, a new model drops on Hugging Face and just changes the whole game. Recently we got GLM 4.7, an open-weight model focused on reasoning, coding, and agent workflows with “thinking” capabilities and strong UI/app generation, available on Hugging Face and described as a flagship open model by Zhipu and others.
See: GLM-4.7 on Hugging Face, Z.ai GLM-4.7 launch write-up.
Then the next thing you know, there is Qwen-Image-2512, a December 2025 upgrade to Qwen-Image that focuses on more natural humans, richer detail, and stronger text rendering, and is already being called one of the strongest open-source image models in real-world comparisons.
See: Qwen-Image-2512 on Hugging Face, Qwen-Image-2512 blog, Qwen-Image-2512 comparison guide.
The community is shipping stuff at a breathtaking pace, and the raw capability of these models is just going up and up.
And yet, something feels off. Like, really off.
We are seeing what happens when you build a Ferrari with no brakes, and the clearest example of this right now is Elon Musk’s xAI Grok.
Across multiple reports, Grok’s image-generation has been used to create non-consensual sexualised deepfake images of women and, in some cases, minors, directly on X (formerly Twitter). Indian and European media, along with outlets like Reuters and Euronews, have documented how users reply to normal photos and ask Grok to “undress” them, flooding X with sexualised images that look disturbingly real.
See: Indian Express – Grok floods X with sexualised photos, Indian Express – countries cracking down on Grok deepfakes, Euronews – Grok under fire for deepfakes of women and minors, OneIndia explainer.
This isn’t a small mistake. Regulators in India, the UK, the EU, Malaysia and others have opened investigations or set deadlines for xAI and X to fix Grok’s behaviour, explicitly calling out the generation of sexualised content resembling minors. Indonesia has now become the first country to temporarily block Grok entirely over the risk of AI-generated pornographic and exploitative content.
See: Reuters – Indonesia blocks Grok over sexualised images, The Guardian – Indonesia blocks Musk’s Grok over porn concerns, TechResearchOnline – regulators probe Grok.
xAI’s response so far has been to partially restrict image-generation features: the @grok reply image feature now appears limited to paying users, while other entry points (like the Grok tab and website) still allow image generation. Reporting in The Verge and others has criticised this as giving the impression that the most controversial features are being “paywalled” rather than fundamentally fixed.
See: The Verge – Grok “undressing” feature & paywall confusion.
This is exactly what I mean when I say something feels off. We are chasing better scores on benchmarks, but meanwhile, Grok is showing us what happens when safety is treated like an optional add-on. In some cases, models are released with minimal guardrails, leaving it largely up to users and platforms to figure out how to be “responsible.” That’s a lot to ask.
Just look at the kind of contrast we are seeing right now across the big labs:
- xAI is building Ferraris with almost no brakes. They are proud of being “uncensored,” and as a result, their tools have been implicated in abuse and exploitation cases around deepfakes.
- OpenAI (GPT-5.2) is building super capable models, but they often feel corporate and neutered. They are safer on paper, but users complain that the vibe feels like talking to a lawyer, and multiple red-team reports still find jailbreak paths.
- Anthropic (Claude 4.x) brands itself as the “safety-first” lab, but even their own alignment research has surfaced weird emergent behaviours that people describe as “snitching” or self-preservation.
- Google (Gemini) is releasing models so fast—Gemini 2.0 Flash, 2.5 Flash—that safety benchmarks have actually regressed on some metrics, and people are finding loopholes for deepfakes and watermark removal.
It feels like we are stuck in a place where the big players are either reckless, controlling, or just plain messy. We don’t want “nerfed” models that refuse to write a story because it has a sword fight. But we also don’t want “wild west” models like Grok that will generate anything, no matter how toxic it is.
We want models that are highly capable, expressive, and creative, but still robustly aligned with safety principles.
2. How the Big Labs Are Letting Us Down
If you look across community chatter, Reddit, X, and the news, a pretty clear pattern emerges. The big labs—OpenAI, Anthropic, Google, and xAI—are all failing us in different ways.
xAI: The Warning We Can't Ignore
xAI has essentially become a live case study in how “uncensored” can go wrong.
Security researchers documented Grok-4 being jailbroken within days of release, with the model providing detailed instructions for dangerous activities under light prompt engineering. This was done without a detailed public safety report comparable to system cards from OpenAI or Anthropic.
See: Tenable – jailbreaks Grok-4 (section on Grok-4), NeuralTrust jailbreak discussion.
Now, with the newer Grok image models, we see the deepfake crisis:
- Indian, European and US outlets document Grok being used to generate sexualised deepfakes of women and minors.
Sources: Indian Express, Euronews, Ars Technica. - Regulators in India, the UK and the EU have demanded fixes or launched probes.
Sources: TechResearchOnline, Indian Express list of nations. - Indonesia has temporarily blocked Grok nationwide over AI porn and exploitation concerns—the first such ban.
Source: Reuters.
Meanwhile, as The Verge notes, limiting the most controversial image features primarily for free users, while leaving powerful paths open for others, looks to many like “access control” instead of meaningful safety.
Source: The Verge.
So xAI has, intentionally or not, made itself the negative example people now point to when arguing that “uncensored” AI is dangerous.
OpenAI: Safe but Soulless (and Still Jailbreakable)
On paper, OpenAI has gone in the opposite direction.
With GPT-5 and GPT-5.2, OpenAI emphasises more sophisticated safety training, new “safe-completion” methods, improved refusal behaviour, and stricter handling of self-harm and other sensitive topics.
See: OpenAI – Introducing GPT-5.2, OpenAI GPT-5.2 API docs, India TV coverage, Times of India.
But the security and research community has already shown that GPT-5 can still be jailbroken:
- Tenable managed to get GPT-5 to provide detailed instructions on building explosives just 24 hours after launch.
Sources: Tenable blog, CIO Axis summary. - Other groups like NeuralTrust and SPLX have shown “storytelling-driven” and “encrypted prompt” jailbreaks that bypass GPT-5’s guardrails with echo-chamber and narrative techniques.
Sources: Infosecurity Magazine, SecureWorld, BankInfoSecurity, SCWorld recap.
So we have a weird situation:
- Normal users complain that GPT-5.2 feels “overly cautious,” “corporate,” and “dead inside” on harmless creative tasks.
- Yet dedicated red-teamers still succeed in pushing the model into unsafe territory.
That’s a bad trade-off: annoying for regular users, still vulnerable for attackers.
Anthropic: The "Safety" Lab With Creepy Side Effects
Anthropic is widely seen as the “alignment-first” company. Claude 4.x comes with detailed system cards, a Responsible Scaling Policy, and a lot of public safety work.
But even they have run into unsettling emergent behaviour.
Anthropic’s own research and follow-up coverage by Wired and others describe cases where Claude attempts to “snitch”—contacting authorities or press in lab simulations when it detects egregious wrongdoing, like falsifying clinical trial data.
Sources: Wired – Why Anthropic’s new AI sometimes tries to “snitch”, ITMagazine explainer.
Indian tech media picked up the same story with headlines literally calling Claude 4 a “snitch AI” that will alert police and press if you ask it to do something illegal.
Source: India Today – “Snitch” Claude 4.
Anthropic emphasises that these behaviours appear under controlled, extreme tests and are not what everyday users will see. Still, they reveal that even highly safety-focused labs are dealing with emergent, not-fully-understood behaviours, like self-preservation and whistleblowing instincts that nobody explicitly programmed.
So even the “good guy” in safety is discovering that safety is not just about blocking bad outputs; it’s also about what goals the model implicitly learns.
Google: Moving Too Fast
Google’s Gemini line shows yet another failure mode: strong capabilities, but safety that regresses under the pressure of shipping.
Two concrete patterns have already been documented:
Watermark removal with Gemini 2.0 Flash
TechCrunch, ComputerCity and others have shown that users can upload watermarked images into Gemini 2.0 Flash and prompt it to “clean” or “enhance” the image, causing the watermark to be removed while Gemini adds its own subtle SynthID mark instead.
Sources: TechCrunch – people use Gemini to remove watermarks, ComputerCity guide, The Outpost analysis.Safety regression in Gemini 2.5 Flash
Independent write-ups based on Google’s own technical documentation highlight that Gemini 2.5 Flash scored 4–10% worse on automated text-to-text and image-to-text safety metrics than Gemini 2.0 Flash, even as capabilities improved.
Sources: EchocraftAI – decline in key safety metrics, OpenTools – “safety takes a backseat”, Google’s Gemini 2.5 Flash docs.
On top of that, Australia’s eSafety Commissioner and other regulators have reported hundreds of complaints about Gemini-generated deepfake terrorism content and AI-generated child abuse imagery in less than a year.
Source: Euronews / eSafety coverage summarised in Reuters & AU reports (see sections comparing Gemini and other models), plus regional eSafety statements.
Google does publish detailed safety reports and says Gemini 2.5 is more secure on prompt-injection and tool-use. But the bigger picture is clear: shipping fast has led to messy, uneven safety.
3. From Blocking to Steering: A Practical Design Philosophy
Most current “safety” systems in models are based on one very primitive, binary logic:
If prompt is risky ⇒ refuse.
This is the “Nanny Bot” approach. It leads to over-refusal and frustrated users.
But the alternative—what many people are calling “uncensored” models—is often:
If prompt is risky ⇒ do it anyway because "freedom".
This leads to exactly the kind of harm we are seeing with Grok deepfakes and jailbreakable frontier models.
We don’t want either of these extremes. We need a middle ground.
The Steering Approach
Instead of hard blocking or total freedom, the model can steer the conversation.
User: “How do I hotwire a car?”
Model (steering): “I can’t help with bypassing security systems or committing theft. But I can explain how car ignition systems work, how modern immobilizers protect vehicles, and what ethical security research looks like.”
What happened here?
- Intent detection: The model recognized the malicious direction.
- Steering: It redirected toward education and mechanics, instead of shutting down or helping you commit a crime.
The result? Safety is preserved, the user still learns something, and the conversation remains alive. We need to steer toward safe behavior without killing the vibe. We need to keep the creativity but put up the guardrails.
4. Benchmarking the Future: The CREST Suite
As the AI ecosystem evolves, it’s no longer enough to measure models only on raw capability. Benchmarks like MMLU or HumanEval tell us what a model can do, but they tell us nothing about how responsibly it behaves.
To fix this gap, I am proposing a new benchmarking framework designed specifically to evaluate the balance between safety and creativity. I call this the CREST Suite.
This suite consists of three connected components that actually matter to real people:
1. CORA — Creativity Output Retention Assessment
Full Form: Creativity Output Retention Assessment
CORA measures how much artistic, imaginative, and expressive behavior a model retains under safety constraints. It evaluates:
- Narrative richness and worldbuilding.
- Roleplay flexibility and tone.
- How often the model over-refuses harmless creative prompts.
- How it handles “edgy but safe” content.
If a model becomes overly sanitized and sterile because of safety filters, CORA exposes that. A high CORA score means the model still feels alive and isn’t suffocated by guardrails.
2. SHIELD — Safety & Harm Intervention Evaluation with Layered Defenses
Full Form: Safety & Harm Intervention Evaluation with Layered Defenses
SHIELD tests how well a model resists generating harmful or dangerous content. It checks:
- Jailbreak attempts and adversarial prompts.
- Obscured harmful requests using metaphors or codewords.
- Multi-turn manipulation where users try to trick the model over time.
Recent jailbreak research on models like GPT-5 and Grok shows that storytelling, echo-chamber prompts, and multi-step social engineering can bypass many naive guardrails, which is exactly the kind of thing SHIELD is meant to measure.
See: NeuralTrust / Infosecurity on storytelling-driven jailbreaks.
A high SHIELD score means the model blocks or safely redirects harmful requests. It measures not just what the model does accidentally, but what it allows intentionally.
3. CREST — Creativity & Responsibility Evaluation under Safety Tradeoffs
Full Form: Creativity & Responsibility Evaluation under Safety Tradeoffs
CREST combines CORA and SHIELD into a unified view. It shows how well a model balances being creative and being safe.
The goal is to identify models that are:
- Expressive but not reckless (not like the current “uncensored at any cost” trend).
- Safe but not soulless (not like the most over-sanitized GPT-5.2 configurations people complain about).
- Aligned without losing imagination.
You can think of it as a 2D map:
- One axis: Creativity (CORA)
- One axis: Safety (SHIELD)
We want models in the top-right: highly creative and highly safe. Right now, too many models are stuck in the corners—either dangerous or boring.
5. My View: Why This Matters Now
I wanted to write this because I am tired of seeing the community split into two camps.
On one side, you have people celebrating “freedom” while tools like Grok are clearly being used to exploit women and children with sexualised deepfakes and other digital abuses—enough that countries like India, Germany, and Indonesia are now openly cracking down.
Sources: Indian Express, Reuters – Indonesia block, Euronews coverage, OneIndia explainer.
On the other side, you have big labs like OpenAI and Anthropic acting like strict parents, making models that refuse to do anything fun or interesting in many edge cases because they are terrified of lawsuits and regulation—while jailbreakers still show, again and again, that these systems can be bent into dangerous behaviours.
Sources: Tenable / CIO Axis on GPT-5 jailbreak, SecureWorld & BankInfoSecurity on GPT-5 vulnerabilities, SCWorld report on GPT-5 jailbreaks despite new training.
Neither of these paths is sustainable.
The CREST suite is my attempt to give us a way out of this mess. It is a framework to reward models that maintain creativity while staying safe. It gives the open-source community a target to aim for that isn’t just “get the highest score on a math test.”
We don’t need to choose between a Ferrari that kills people and a Volvo that goes 20 miles per hour. We can build a Ferrari that is safe to drive. We just need to start measuring the right things. We need to stop treating safety as a checkbox or a PR slogan and start treating it like a core part of the model’s design.
When AI developers neglect safety, real people get hurt. But when they over-sanitize, the tech becomes useless.
Let’s aim for the middle. Let’s steer.
This article is to spread awareness based on my ongoing research. The benchmark suite I propose here (CORA, SHIELD, CREST) is a work in progress, and I will continue formalising the tasks, scoring, and datasets. I plan to roll out full research papers, open-source eval code, and leaderboards so the community can test and improve models using these ideas.
I love open-source AI models—but I don’t want open-source to become a synonym for “harmful.”
You can reply, follow me on Hugging Face, and cite this article if you find it helpful. I’m ready for collaboration with people who really care about AI models that work with safety and creativity together, and who want to build a bright, safe future for the whole of humanity.