Making o1, o3, and Sonnet 3.7 hallucinate for everyone

by hahahacornon 3/1/25, 6:24 PMwith 219 comments
by andixon 3/1/25, 7:43 PM

I've got a lot of hallucinations like that from LLMs. I really don't get how so many people can get LLMs to code most of their tasks without those issues permanently popping up.

by dominicqon 3/1/25, 7:14 PM

ChatGPT used to assure me that you can use JS dot notation to access elements in a Python dict. It also invented Redocly CLI flags that don't exist. Claude sometimes invents OpenAPI specification rules. Any time I ask anything remotely niche, LLMs are often bad.

by latexron 3/1/25, 8:00 PM

> Conclusion

> LLMs are really smart most of the time.

No, the conclusion is they’re never ā€œsmartā€. All they do is regurgitate text which resembles a continuation of what came before, and sometimes—but with zero guarantees—that text aligns with reality.

by simonwon 3/2/25, 6:26 AM

Every time this topic comes up I post a similar comment about how hallucinations in code really don't matter because they reveal themselves the second you try to run that code.

I've just written up a longer form of that comment: "Hallucinations in code are the least dangerous form of LLM mistakes" - https://simonwillison.net/2025/Mar/2/hallucinations-in-code/

by Chance-Deviceon 3/1/25, 7:11 PM

It’s not really hallucinating though, is it? It’s repeating a pattern in its training data, which is wrong but is presented in that training data (and by the author of this piece, but unintentionally) as being the solution to the problem. So this has more in common with an attack than a hallucination on the LLM’s part.

by adamgordonbellon 3/1/25, 7:53 PM

We at pulumi started treating some hallucinations like this as feature requests.

Sometimes an llm will hallucination a flag, or option that really makes sense - it just doesn't actually exist.

by joelthelionon 3/1/25, 9:21 PM

Hallucinations like this could be a great way to identify missing features or confusing parts of your framework. If the llm invents it, maybe it ought to be like this?

by mberningon 3/1/25, 7:34 PM

In my experience LLMs do this kind of thing with enough frequency that I don’t consider them as my primary research tool. I can’t afford to be sent down rabbit holes which are barely discernible from reality.

by IAmNotACelliston 3/1/25, 8:23 PM

"Not acceptable. Please upgrade your browser to continue." No, I don't think I will.

by aranwon 3/1/25, 8:33 PM

I wonder how easy it would be to influence super LLMs if a particular group of people created enough articles that were clear to any human reader that it's a load of garbage and rubbish and should ignore it but if a LLM was to parse it wouldn't realise and then ruin it's reasoning and code generation abilities?

by Narretzon 3/1/25, 7:11 PM

This is interesting. If the models had enough actual code as training data, that forum post code should have very little weight, shouldn't it? Why do the LLMs prefer it?

by lxeon 3/1/25, 10:37 PM

This is incredible, and it's not technically a "hallucination". I bet it's relatively easy to find more examples like this... something on the internet that's both niche enough, popular enough, and wrong, yet was scraped and trained on.

by leumonon 3/1/25, 8:32 PM

He should've tested 4.5. This model is hallucinating much less than any other model.

by Baggieon 3/1/25, 9:53 PM

The conclusion paragraph was really funny and kinda perfectly encapsulates the current state of AI, but as pointed out by another comment, we can't even call them smart, just "Ctrl C Ctrl V Leeroy Jenkins style"

by jwjohnson314on 3/1/25, 10:18 PM

The interesting thing here to me is that the llm isn’t ā€˜hallucinating’, it’s simply regurgitating some data it digested during training.

by zeroqon 3/2/25, 2:31 AM

This is exactly what I mean when I say tell me your bad without saying so. Most people here disagree with that.

A while back a friend of mine told me he's very found of llms because he's confused with kubernetes cli and instead of looking up the answer on the internet he can simply state his desire in a chat to get the right answer.

Well... Sure, but if you'd look the answer on stackoverflow you'd see the whole thread including comments and you'd had the opportunity to understand what the command actually does.

It's quite easy to create a catastrophic event in kubernetes if you don't know what you're doing.

If you blindly trust llms in such scenarios sooner or later you'll find yourself in a lot of trouble.

by saurikon 3/1/25, 10:41 PM

What I honestly find most interesting about this is the thought that hallucinations might lead to the kind of emergent language design we see in natural language (which might not be a good thing for a computer language, fwiw, but still interesting), where people just kind of thing "language should work this way and if I say it like this people will probably understand me".

by sirolimuson 3/1/25, 6:25 PM

o3-mini or o3-mini-high?

by egberts1on 3/2/25, 5:45 PM

Write me a Mastercard/Visa fraud detection code in Ada, please.

by forum-soon-yuckon 3/2/25, 10:04 AM

Good luck staking the future on AI