Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

OpenAI’s o3 offers new ways to extend AI models, but so does the cost


Last month, AI founders and investors told TechCrunch that we now “second cycle of scaling laws,” noting that established methods for improving AI models show diminishing returns. One of the promising new methods they suggest can keep earnings “test time scale,” what is behind the performance OpenAI’s o3 model – but it has its drawbacks.

Much of the AI ​​world took the announcement of OpenAI’s o3 model as proof that progress in artificial intelligence has not “hit a wall.” The o3 model significantly outperformed all other models in the general ability test called ARC-AGI, scoring 25%. difficult math test no other AI model scored more than 2%.

Of course, we at TechCrunch are taking all of this with a grain of salt until we’ve tried the o3 for ourselves (few have yet). But even before the release of o3, the world of AI is already convinced that something big has changed.

Noam Brown, creator of OpenAI’s o-series models, noted on Friday that the startup is announcing o3’s impressive gains just three months after announcing o1 — a relatively short period of time for such a leap in performance.

“We have every reason to believe this trajectory will continue,” Brown said tweet.

Anthropic co-founder Jack Clarke a blog post On Monday, o3 said there is evidence that artificial intelligence will “progress faster in 2025 than in 2024.” (Remember that Anthropic benefits – especially the ability to capitalize – to say that the laws of AI scaling continue even if Clark completes the opponent.)

In the coming year, Clark says, the AI ​​world will combine test-time scaling and traditional pre-training measurements to get more revenue from AI models. Perhaps he suggests that Anthropic and other AI model providers will release their own reasoning models in 2025. Google did last week.

The scale of the test time means that OpenAI uses more computation during the inference phase of ChatGPT, the time that elapses after pressing enter on a query. It’s not entirely clear what’s going on behind the scenes: OpenAI either uses more computer chips to answer a user’s question, runs more powerful inference chips, or runs those chips for longer—10 to 15 minutes in some cases. The AI ​​responds. We don’t know all the details of how o3 is done, but these benchmarks are the first signs that the scale of testing time can work to improve the performance of AI models.

While o3 may give renewed credence to the progress of some AI scaling laws, OpenAI’s newest model also uses an unprecedented level of computation, which means a higher cost per answer.

“Perhaps the only important caveat here is to understand that one of the reasons O3 is so good is because it costs more to get results—in some problems, the ability to use computational tools during testing, you can turn the computation into a better answer. ”, Clark writes on his blog. “It’s interesting because it’s made the cost of running AI systems a little less predictable—before you could just look at the model and the cost of building a particular product and figure out how much it costs to service a generative model.”

Clark and others cited the o3’s performance on the ARC-AGI benchmark—a tough test used to evaluate advances in AGI—as an indicator of its progress. It should be noted that passing this test does not mean an AI model, according to its creators. achieved AGI, by contrast, is a way to measure progress toward a nebulous goal. However, the o3 model scored 88% on one of its attempts, beating the scores of all previous AI models tested. OpenAI’s next best AI model, o1, scored just 32%.

Chart showing the performance of OpenAI’s o-series in the ARC-AGI test.Image credits:ARC Award

However, the logarithmic x-axis in this chart may bother some. The high-scoring version of o3 used more than $1,000 in calculations per task. The o1 models used about $5 of computing per task, while the o1-mini only used a few cents.

Francois Chollet, creator of the ARC-AGI benchmark, writes in a blog Compared to O3’s high-performance version, which scored just 12% lower, OpenAI used nearly 170 times more computation to generate that 88% score. The high-scoring version of O3 used more than $10,000 in resources to complete the test, making it very expensive to compete for the ARC Award – an unbeatable competition for AI models to pass the ARC test.

However, Chollet says o3 was still a leap forward for AI models.

“o3 is a system that can adapt to tasks it has never faced before, approaching human-level performance in the ARC-AGI domain,” Chollet said in a blog post. “Of course, this kind of generalization is expensive and not yet cost-effective: you can pay a person to solve ARC-AGI tasks for about $5 per task (we know we did) for just cents. in energy.”

It’s too early to say exactly how much all of this will cost – we’ve seen AI models plummet in price over the last year, and OpenAI has yet to reveal how much o3 will cost. However, these prices show how much computation is required to even slightly break the performance barriers imposed by today’s leading AI models.

This raises some questions. What is O3 actually for? How much more computing is needed to get more profit with o4, o5, or whatever OpenAI calls the next reasoning models?

It doesn’t look like O3 or its successors can be anyone’s “daily driver” like GPT-4o or Google Search. During the day, these models ask, “How can the Cleveland Browns still make the playoffs in 2024?” uses a lot of calculations to answer small questions like

Instead, AI models with time-tested calculations at scale only ask, “How can the Cleveland Browns become a Super Bowl franchise in 2027?” It can be good for big picture proposals like However, it’s probably only worth the high computing costs if you’re the general manager of the Cleveland Browns and you’re using these tools to make some big decisions.

As Wharton professor Ethan Mollick points out in an article, institutions with deep pockets may be the only ones that can afford o3, at least to begin with. tweet.

We have already seen the release of OpenAI $200 level to use high computing version of o1but there is in the beginning It is said to be the weight of creating subscription plans that cost up to $2,000. When you see how much compute o3 uses, you can understand why OpenAI would consider it.

But there are drawbacks to using o3 for high-impact work. As Chollet points out, the o3 is not an AGI, and it still can’t do some very easy tasks that a human could do quite easily.

As with large language models, this is not necessarily surprising still have a big hallucination problemwhich o3 and test time calculation seems to be unresolved. That’s why ChatGPT and Gemini include disclaimers under every answer they produce, asking users not to trust nominal answers. Presumably AGI, if ever achieved, would not need such a waiver.

One way to get more gains in test time scale would be better AI inference chips. There’s no shortage of startups tackling this issue, like Grog or Cerebras, while other startups are making more affordable AI chips like MatX. Andreessen Horowitz senior partner Anjney Midha previously told TechCrunch expects these startups to play a bigger role the test progresses through the time scale.

While o3 is a significant improvement in the performance of AI models, it raises several new questions about usability and costs. However, o3’s performance adds credence to the claim that test timing is the tech industry’s next best way to scale AI models.

TechCrunch has an AI-powered newsletter! Register here to receive in your inbox every Wednesday.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *