I haven’t tried pro yet but just yesterday I asked O3 to review a file and I saw a message in the chain-of-thought like “it’s going to be hard to give a comprehensive answer within the time limit” so now I’m tempted
Chat just isn't the best format for something that takes 15-20 minutes (on average) to come up with a response. Email would unironically be better. Send a very long and detailed prompt, like a business email, and get a response back whenever it's ready. Then you can refine the prompt in another email, etc.
But I should note that o3-pro has been getting faster for me lately. At first every damn thing, however simple, took 15+ minutes. Today I got a few answers back within 5 minutes.
I’ve found throw the problem at 3 o3 pros and have another one evaluate and synthesize works really well
when o3 pricing dropped 80%, most wrote the entire model family off as a downgrade (including me). but usage patterns flipped people finally ran real tasks through it. it's one of the few that holds state across fragmented prompts without collapsing context. used it to audit a messy auth flow spread over 6 services. didn't shortcut, didn't hallucinate edge cases. slow, but deliberate. in kahneman terms, it runs system 2 by default. many still benchmark on token speed, missing what actually matters
> Arena has gotten quite silly if treated as a comprehensive measure (as in Gemini 2.5 Flash is rated above o3)
> The problem with o3-pro is that it is slow.
well maybe Arena is not that silly then. poorly argued/organized article.
What is the difference between o3 pro and deep research? From a glance, both seem to take 10-15mins to respond and use o3 as the base model.
I've tried o3 Pro for my use cases (parsing emails in the legal profession) and didn't have better results than the non pro.
In fact, o1-preview has given me more consistently correct results than any other model. But it's being sunset next month so I have to move to o3.
I use Claude Code a lot. A lot lot. I make it do Atomic Git commits for me. When it gets stuck and instead of just saying so starts to refactor half of the codebase, I jump back to commit where the issue first appeared and get a summary of the involved files. Those in full text (not files) into o3 pro. And you can be sure it finds the issue or gives a direction where the issue does not appear. Would love o3-pro as am MCP so whenever Claude Code goes on a "lets refactor everything" coding spree it just asks o3 pro.
I use O3-pro not as a coding model, but as a strategic assistant. For me, the long delay between responses makes the model unsuitable for coding workflows, however, it is actually a feature when it comes to getting answers to hard questions impacting my (or my friend's/family's) day to day life.
"'take your profits’ in quality versus quantity is up to you."
As mainly AI invester not AI user, I think profitability is great importance. It has been race to top so far, soon we see race to the bottom.
Here are my own anecdotes from using o3-pro recently.
My primary use cases where I am willing to wait 10-20 minutes for an answer from the "big slow" model (o3-pro) is code reviews of large amounts of code. I have been comparing results on this task from the three models above.
Oddly, I see many cases where each model will surface issues that the other two miss. In previous months when running this test (e.g., Claude 3.7 Sonnet vs o1-pro vs earlier Gemini), that wasn't the case. Back then, the best model (o1-pro) would almost always find all the issues that the other models found. But now it seems they each have their own blindspots (although they are also all better than the previous generation of models).
With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).
Whether o3-pro or Gemini 2.5 Pro is better is less clear. o3-pro will report more issues, but it also has a tendency to confabulate problems. My workflow involves providing the model with a diff of all changes, plus the full contents of the files that were changed. o3-pro seems to have a tendency to imagine and report problems in the files that were not provided to it. It also has an odd new failure mode, which is very consistent: it gets confused by the fact that I provide both the diff and the full file contents. It "sees" parts of the same code twice and will usually report that there has accidentally been some code duplicated. Base o3 does this as well. None of the other models get confused in that way, and I also do not remember seeing that failure mode with o1-pro.
Nevertheless, it seems o3-pro can sometimes find real issues that Gemini 2.5 Pro and Opus 4 cannot more often than vice versa.
Back in the o1-pro days, it was fairly straightforward in my testing for this use case that o1-pro was simply better across the board. Now with o3-pro compared particularly with Gemini 2.5 Pro, it's no longer clear whether the bonus of occasionally finding a problem that Gemini misses is worth the trouble of (1) waiting way longer for an answer and (2) sifting through more false positives.
My other common code-related use case is actually writing code. Here, Claude Code (with Opus 4) is amazing and has replaced all my other use of coding models, including Cursor. I now code almost exclusively by peer programming with Claude Code, allowing it to be the code writer while I oversee and review. The OpenAI competitor to Claude Code, called Codex CLI, feels distinctly undercooked. It has a recurring problem where it seems to "forget" that it is an agent that needs to go ahead and edit files, and it will instead start to offer me suggestions about how I can make the change. It also hallucinates running commands on a regular basis (e.g., I tell it to commit the changes we've done, and outputs that it has done so, but it has not.)
So where will I spend my $200 monthly model budget? Answer: Claude, for nearly unlimited use of Claude Code. For highly complex tasks, I switch to Gemini 2.5 Pro, which is still free in AI Studio. If I can wait 10+ minutes, I may hand it to o3-pro. But once my ChatGPT Pro subscription expires this month, I may either stop using o3-pro altogether, or I may occasionally use it as a second opinion by paying on-demand through the API.
I feel really sorry for anyone using o3. It is really, really bad...
> My experience so far is that waiting a long time is annoying, sufficiently annoying that you often won’t want to wait.
My solution for this has been to use non-reasoning models, and so far in 90% of the situations I have received the exact same results from both.
I'm using Pro. It's definitely a "hand it to the team and have them schedule a meeting to get back to me" speed tool. But, it "feels" better to me than o3, and significantly better than gemini/claude for that use case. I do trust it more on confabulations; my current trust hierarchy would be o3-pro -> o3 -> gemini -> claude opus -> (a bunch of stuff) -> 4o.
That said, I'd like this quality with a relatively quick tool using model; I'm not sure what else I'd want to call it "AGI" at that point.